Tokenizing

Internationalization Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

7.3 Tokenizing

RWUTokenizer finds delimiters in source strings, and provides sequential access to the tokens between those delimiters.

7.3.1 Creating an RWUTokenizer

An RWUTokenizer stores a deep copy of a string to be tokenized. You can pass the string in the constructor, as shown below:

RWUConversionContext ascii("ascii");
RWUString text("This is a string.");
RWUTokenizer tok(text);

The setText() method assigns a new string to a tokenizer, like so:

RWUString text2("This is another string.");
tok.setText(text2);

The getText() method returns a copy of the current string associated with a tokenizer.

It is also possible to construct an empty RWUTokenizer, but no tokens can be obtained from such a tokenizer until the setText() method is used to assign a string to the tokenizer. For example:

RWUTokenizer tok2();
tok2.setText(text);

7.3.2 Specifying Delimiters

Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. Consider the following string:

Token1,Token2,Token3

Using the set of delimiter characters consisting of only a comma, you could break the string into the following three tokens:

Token1
Token2
Token3

RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request.

Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the delimiter argument is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N code units in the delimiter argument are considered as potential delimiters.

Finally, you can specify the delimiter argument as an RWURegularExpression. This technique allows for the specification of complex, multicharacter delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter spanning a number of code points.

Note that the static method RWUCharTraits::getWhitespace() returns a null-terminated array of whitespace code points, as a convenience for use as delimiters.

7.3.3 Extracting Tokens

RWUTokenizer provides two variations on the tokenizing interface:

the function call operator, operator()()
the nextToken() method

Each interface has overloads that allow you to tokenize using the default delimiters, a set of delimiters specified as an RWUString, a set of delimiters specified as the first N code units of an RWUString, or a set of delimiters specified as an RWURegularExpression (see Section 7.3.2).

7.3.3.1 Using the Function Call Operator

In the tradition of RWCTokenizer, the function call operator interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of the specified set of delimiter characters. As such, the function call operator does not return empty tokens. For example, this code extracts all tokens from a string using the function call operator and the default delimiters:

RWUConversionContext ascii("ascii");

RWUString text("This is a string.");
RWUString next;
RWUTokenizer tok(text);

for (next = tok(); !next.isNull(); next = tok()) {
   // Process each token
}

Note the use of the traditional empty token condition to detect the end of tokenization.

The following tokens are extracted by this code:

This
is
a
string.

7.3.3.2 Using the nextToken() Method

The nextToken() interface simply returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. For example, this code extracts all tokens from a string using the nextToken() method:

RWUConversionContext ascii("ascii");

RWUString text("John,Doe;,,33,175;");
RWUString delimiters(",;");
RWUString next;
RWUTokenizer tok(text);

while (!tok.done()) {
   next = tok.nextToken(delimiters);
   // Process the token
}

The following tokens are extracted by this code:

John
Doe


33
175

Note that the comma and semicolon characters act as delimiters, and are specified using an RWUString.

In this case, two empty tokens are extracted by nextToken(). If the function call operator tokenizing interface had been used instead, the empty tokens would not be returned.

This code below illustrates tokenizing a string using a regular expression delimiter and the nextToken() interface:

RWUConversionContext ascii("ascii");
RWUString text("John,     Doe,       33,175;");
RWURegularExpression delimiters(RWCString("[{Zs}]*[,;][{Zs}]*"));
RWUString    next;
RWUTokenizer tok(text);

while (!tok.done()) {
   next = tok.nextToken(delimiters);
   // Process the token
}

The following tokens are extracted by this code:

John
Doe
33
175

The RWURegularExpression delimiter expression

RWURegularExpression delimiters(RWCString("[{Zs}]*[,;][{Zs}]*"));

specifies any number of occurrences of whitespace, followed by either a comma or a semicolon, followed by any number of whitespace characters. (See Section 8.4 for more information on RWURegularExpression.)

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.