RWUTokenizer

Internationalization Module Reference Guide
Rogue Wave web site: Home Page | Main Documentation Page

RWUTokenizer

Module: Internationalization Module Group: Unicode String Processing


Does Not Inherit

Local Index
Header File
Description
Example
Public Constructors
Public Destructor
Public Member Operators
Public Member Functions

Local Index

Members

done()
getText()

nextToken()
operator()()

operator=()
RWUTokenizer()

setText()

Header File

#include <rw/i18n/RWUTokenizer.h>

Description

RWUTokenizer finds delimiters in source strings, and provides sequential access to the tokens between those delimiters.

Delimiter characters are a user-defined set of characters used to separate the tokens, or fields, in a string. For example, consider the string:

Token1,Token2,Token3

Using the set of delimiter characters consisting of only a comma, you could break the string into three tokens:

Token1
Token2
Token3

RWUTokenizer provides methods for extracting in sequence each token from a string, while specifying a set of delimiters with each token request. Any single code point within the string is a candidate delimiter.

Delimiters can be specified in a variety of ways. If no delimiters are specified, then the next token is extracted using a predefined set of delimiter characters. This set consists of the following: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

Alternatively, you can specify an RWUString, composed of a set of delimiter characters. Each code point in the input RWUString is taken as a possible delimiter character. A slight variation on this technique allows you to specify that only the first N code units in the delimiter string be considered as potential delimiters, in which case the string may have embedded nulls.

Finally, you can specify the delimiters as an RWURegularExpression. This technique allows for the specification of complex, multi-character delimiters. While the above techniques search for only single character (code point) delimiters, the regular expression interface could consume a single delimiter consisting of a number of code points.

Two variations on the interface are provided. The first is provided using the function call operator()(). In the tradition of RWCTokenizer, this interface scans a string for all occurrences of tokens, consuming all consecutive occurrences of a delimiter. As such, the function call operator does not return empty tokens.

The second variation on the interface is provided through a set of overloads of the nextToken() method. This version of the interface returns the next token, which may be empty. This allows search strings to contain empty fields of data. To detect the end of tokenization using this interface, use the done() method on the tokenizer. When using the function call interface, either the done() method, or the traditional empty token condition can be used to detect the end of tokenization.

Example

#include <rw/i18n/RWUTokenizer.h>
#include <rw/i18n/RWUConversionContext.h>
#include <rw/i18n/RWURegularExpression.h>
#include <iostream>

using std::cout;
using std::endl;

int
main()
{
  // Create a conversion context to convert between
  // US-ASCII and Unicode
  RWUConversionContext ascii("US-ASCII");

  // Create a search string
  RWUString text("John, Doe; 33,175;  ; Anchorage, AK");

  // Delimit fields with a `,' or a `;', followed by one or more
  // of whitespace characters.
  RWURegularExpression delim("[,;][{Zs}]+");

  // Create a tokenizer and a string in which to receive tokens
  RWUTokenizer tknzr(text);
  RWUString    token;

  // Extract tokens using the function call operator
  // interface.  Note that empty tokens are *not* returned.
  cout << "Using function call operator:" << endl;
  for (token = tknzr(delim); !token.isNull(); token = tknzr(delim)) {
    cout << "  <" << token << ">" << endl;
  } // for

  // Reset the tokenizer.
  tknzr.setText(text);

  // Extract tokens again, using the nextToken() interface.  
  // Note that consecutive delimiters will cause nextToken()
  // to return an empty token.
  cout << "\nUsing nextToken():" << endl;
  while (!tknzr.done()) {
    token = tknzr.nextToken(delim);
    cout << "  <" << token << ">" << endl;
  } // while

  return 0;
} // main

Results:
========

Using function call operator:
  <John>
  <Doe>
  <33,175>
  <Anchorage>
  <AK>

Using nextToken():
  <John>
  <Doe>
  <33,175>
  <>
  <Anchorage>
  <AK>

Public Constructors

RWUTokenizer();

Default constructor. Constructs an empty RWUTokenizer with no string to be tokenized. No tokens can be obtained from such a tokenizer until the setText() method is used to assign a string to the tokenizer.

RWUTokenizer(const RWUString& text);

Constructs an RWUTokenizer with string text to be tokenized.

RWUTokenizer(const RWUTokenizer& source);

Copy constructor. Initializes an RWUTokenizer as a deep copy of source. The new tokenizer begins tokenizing from the location in the search string where the source tokenizer left off. Tokenizations within either tokenizer do not affect the state of the other.

Public Destructor

~RWUTokenizer();

Destructor.

Public Member Operators

RWUTokenizer&
operator=(const RWUTokenizer& rhs);

Assignment operator. Initializes an RWUTokenizer as a deep copy of rhs. The new tokenizer begins tokenizing from the location in the search string where the rhs tokenizer left off. Tokenizations within either tokenizer do not affect the state of the other. Returns a reference to self.

RWUConstSubString
operator()();

Returns the next token, using default set of delimiter characters: 0x0009 (horizontal tab), 0x000A (line feed), 0x000C (form feed), 0x000D (carriage return), 0x0020 (space), 0x0085 (next line), 0x2028 (line separator), 0x2029 (paragraph separator), and 0x0000 (null).

This method consumes consecutive occurrences of any delimiter code point, skipping over any empty fields that may be present in the string. To obtain empty fields as well as non-empty fields, use the nextToken() method.

RWUConstSubString
operator()(const RWUString& str);

Returns the next token, using specified string str of delimiter characters.

RWUConstSubString
operator()(const RWUString& str, size_t num);

Returns the next token, using the first num code units from the input string str as the set of delimiter characters.

RWUConstSubString
operator()(RWURegularExpression& regex);

Returns the next token, using a delimiter pattern represented by the regular expression pattern regex.

Unlike the other operator() overloads, this method allows a single occurrence of a delimiter to span multiple characters. For example, consider the RWUTokenizer instance tok. The statement tok(RWUString("ab")) treats either a or b as a delimiter character, but tok(RWURegularExpression("ab")) treats the two-character pattern ab as a single delimiter.

Public Member Functions

bool
done() const;

Returns true if the last token from the search string has been extracted; otherwise, false. When using the function call operator interface, this equates to the last non-empty token having been returned.

RWUString
getText() const;

Returns a copy of the string associated with self.

RWUConstSubString
nextToken();

This method may return an empty token if there are consecutive occurrences of any delimiter code point in the search string.

RWUConstSubString
nextToken(const RWUString& str);

Returns the next token, using the specified string str of delimiter code points.

This method may return an empty token if there are consecutive occurrences of any delimiter character in the search string.

RWUConstSubString
nextToken(const RWUString& str, size_t num);

Returns the next token, using the first num code units from the given string str as the set of delimiter code points.

This method may return an empty token if there are consecutive occurrences of any delimiter character in the search string.

RWUConstSubString
nextToken(RWURegularExpression& regex);

Returns the next token, using a delimiter pattern represented by a regular expression pattern.

Unlike the other nextToken() overloads, this method allows a single occurrence of a delimiter to span multiple characters. For example, nextToken(RWUString("ab")) treats either a or b as a delimiter character, but nextToken(RWURegularExpression("ab")) treats the two-character pattern ab as a single delimiter.

This method may return an empty token if there are consecutive occurrences of the delimiter pattern in the search string.

void
setText(const RWUString& text);

Sets the string to be tokenized by self to text. The starting position is set to the beginning of the string. A deep copy of the text string is stored within the tokenizer.

© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.