Internationalization Module User’s Guide : Chapter 7 Boundary Analysis and Tokenizing
Chapter 7 Boundary Analysis and Tokenizing
Overview
The Internationalization Module contains two classes for finding delimiters in Unicode strings:
RWUBreakSearch finds the locations of breaks in text. This class correctly interprets whitespace and punctuation based on a specific locale.
RWUTokenizer finds delimiters, and sequentially returns the tokens between those delimiters. By default, RWUTokenizer uses a predefined set of whitespace characters as delimiters. Optionally, it uses a specified set of arbitrary characters or a regular expression. Using a regular expression as a token delimiter permits complex, multicharacter delimiters.
This chapter describes how to use these classes.