Extracting Tokens

Internationalization Module User’s Guide : Chapter 7 Boundary Analysis and Tokenizing : Tokenizing : Extracting Tokens

Extracting Tokens

RWUTokenizer provides two variations on the tokenizing interface:

• the function call operator, operator()()

• the nextToken() method

Each interface has overloads that allow you to tokenize using the default delimiters, a set of delimiters specified as an RWUString, a set of delimiters specified as the first N code units of an RWUString, or a set of delimiters specified as an RWURegularExpression (see “Specifying Delimiters”).

Using the Function Call Operator

In the tradition of RWCTokenizer, the function call operator interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of the specified set of delimiter characters. As such, the function call operator does not return empty tokens. For example, this code extracts all tokens from a string using the function call operator and the default delimiters:

RWUConversionContext ascii("ascii");

RWUString text("This is a string.");

RWUString next;

RWUTokenizer tok(text);

for (next = tok(); !next.isNull(); next = tok()) {

// Process each token

}

Note the use of the traditional empty token condition to detect the end of tokenization.

The following tokens are extracted by this code:

This

string.

Using the nextToken() Method

The nextToken() interface simply returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. For example, this code extracts all tokens from a string using the nextToken() method:

RWUConversionContext ascii("ascii");

RWUString text("John,Doe;,,33,175;");

RWUString delimiters(",;");

RWUString next;

RWUTokenizer tok(text);

while (!tok.done()) {

next = tok.nextToken(delimiters);

// Process the token

}

The following tokens are extracted by this code:

John

Doe

175

Note that the comma and semicolon characters act as delimiters, and are specified using an RWUString.

In this case, two empty tokens are extracted by nextToken(). If the function call operator tokenizing interface had been used instead, the empty tokens would not be returned.

This code below illustrates tokenizing a string using a regular expression delimiter and the nextToken() interface:

RWUConversionContext ascii("ascii");

RWUString text("John, Doe, 33,175;");

RWURegularExpression delimiters(RWCString("[{Zs}]*[,;][{Zs}]*"));

RWUString next;

RWUTokenizer tok(text);

while (!tok.done()) {

next = tok.nextToken(delimiters);

// Process the token

}

The following tokens are extracted by this code:

John

Doe

175

The RWURegularExpression delimiter expression

RWURegularExpression delimiters(RWCString("[{Zs}]*[,;][{Zs}]*"));

specifies any number of occurrences of whitespace, followed by either a comma or a semicolon, followed by any number of whitespace characters. (See “Regular Expression String Searching” for more information on RWURegularExpression.)