Chapter 7 Boundary Analysis and Tokenizing

Internationalization Module User’s Guide : Chapter 7 Boundary Analysis and Tokenizing

Overview

The Internationalization Module contains two classes for finding delimiters in Unicode strings:

• RWUBreakSearch finds the locations of breaks in text. This class correctly interprets whitespace and punctuation based on a specific locale.

• RWUTokenizer finds delimiters, and sequentially returns the tokens between those delimiters. By default, RWUTokenizer uses a predefined set of whitespace characters as delimiters. Optionally, it uses a specified set of arbitrary characters or a regular expression. Using a regular expression as a token delimiter permits complex, multicharacter delimiters.

This chapter describes how to use these classes.