Rogue Wave banner
Previous fileTop of DocumentContentsIndex pageNext file
Internationalization Module Reference Guide
Rogue Wave web site:  Home Page  |  Main Documentation Page

RWUTokenizer

Module:  Internationalization Module   Group:  Unicode String Processing


Does Not Inherit

Local Index

Members

Header File

#include <rw/i18n/RWUTokenizer.h> 

Description

RWUTokenizer finds delimiters in source strings, and provides sequential access to the tokens between those delimiters.

Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. For example, consider the string:

Using the set of delimiter characters consisting of only a comma, you could break the string into three tokens:

RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request. Any single code point within the string is a candidate delimiter.

Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the input RWUString is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N code units in the delimiter string be considered as potential delimiters, in which case the string may have embedded nulls.

Finally, you can specify the delimiters as an RWURegularExpression. This technique allows for the specification of complex, multi-character delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter consisting of a number of code points.

Two variations on the interface are provided. The first is provided using the function call operator()(). In the tradition of RWCTokenizer, this interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of a delimiter. As such, the function call operator does not return empty tokens.

The second variation on the interface is provided through a set of overloads of the nextToken() method. This version of the interface returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. When using the function call interface, either the done() method, or the traditional empty token condition can be used to detect the end of tokenization.

Example

Public Constructors

RWUTokenizer();
RWUTokenizer(const RWUString& text);
RWUTokenizer(const RWUTokenizer& source);

Public Destructor

~RWUTokenizer();

Public Member Operators

RWUTokenizer&
operator=(const RWUTokenizer& rhs);
RWUConstSubString
operator()();
RWUConstSubString
operator()(const RWUString& str);
RWUConstSubString
operator()(const RWUString& str, size_t num);
RWUConstSubString
operator()(RWURegularExpression& regex);

Public Member Functions

bool
done() const;
RWUString
getText() const;
RWUConstSubString
nextToken();
RWUConstSubString
nextToken(const RWUString& str);
RWUConstSubString
nextToken(const RWUString& str, size_t num);
RWUConstSubString
nextToken(RWURegularExpression& regex);
void
setText(const RWUString& text);


Previous fileTop of DocumentContentsIndex pageNext file

© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.