7.3 Tokenizing

RWUTokenizer finds delimiters in source strings, and provides sequential access to the tokens between those delimiters.

An RWUTokenizer stores a deep copy of a string to be tokenized. You can pass the string in the constructor, as shown below:

The setText() method assigns a new string to a tokenizer, like so:

The getText() method returns a copy of the current string associated with a tokenizer.

It is also possible to construct an empty RWUTokenizer, but no tokens can be obtained from such a tokenizer until the setText() method is used to assign a string to the tokenizer. For example:

Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. Consider the following string:

Using the set of delimiter characters consisting of only a comma, you could break the string into the following three tokens:

RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request.

Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the delimiter argument is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N code units in the delimiter argument are considered as potential delimiters.

Finally, you can specify the delimiter argument as an RWURegularExpression. This technique allows for the specification of complex, multicharacter delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter spanning a number of code points.

Note that the static method RWUCharTraits::getWhitespace() returns a null-terminated array of whitespace code points, as a convenience for use as delimiters.

RWUTokenizer provides two variations on the tokenizing interface:

Each interface has overloads that allow you to tokenize using the default delimiters, a set of delimiters specified as an RWUString, a set of delimiters specified as the first N code units of an RWUString, or a set of delimiters specified as an RWURegularExpression (see Section 7.3.2).

In the tradition of RWCTokenizer, the function call operator interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of the specified set of delimiter characters. As such, the function call operator does not return empty tokens. For example, this code extracts all tokens from a string using the function call operator and the default delimiters:

Note the use of the traditional empty token condition to detect the end of tokenization.

The following tokens are extracted by this code:

The nextToken() interface simply returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. For example, this code extracts all tokens from a string using the nextToken() method:

The following tokens are extracted by this code:

Note that the comma and semicolon characters act as delimiters, and are specified using an RWUString.

In this case, two empty tokens are extracted by nextToken(). If the function call operator tokenizing interface had been used instead, the empty tokens would not be returned.

This code below illustrates tokenizing a string using a regular expression delimiter and the nextToken() interface:

The following tokens are extracted by this code:

The RWURegularExpression delimiter expression

specifies any number of occurrences of whitespace, followed by either a comma or a semicolon, followed by any number of whitespace characters. (SeeSection 8.4 for more information on RWURegularExpression.)