Boundary Analysis

Internationalization Module User’s Guide : Chapter 7 Boundary Analysis and Tokenizing : Boundary Analysis

Boundary Analysis

RWUBreakSearch finds the locations of code point, character, word, sentence, and line breaks in text.

• Code point breaks occur before and after each code point.

• Character breaks occur between characters, as defined from the end user's perspective. For example, an accented character can be represented by a single code point or by a pair of code points (one for the base character and another for the accent symbol). Character breaks occur on either side of this logical character, regardless of the number of code points used to represent it. An RWUBreakSearch that searches for character breaks can be used to iterate over the logical characters in a string.

• Word breaks occur before and after each word. They do not occur before and after punctuation contained within a word—such as a hyphen or an apostrophe—but they do occur before and after characters that are not part of a word, such as symbols and other punctuation marks. Note that, in some languages, words are not necessarily surrounded by whitespace. An RWUBreakSearch that searches for word breaks is useful in the creation of operations that find whole words.

• Sentence breaks occur between sentences. RWUBreakSearch attempts to interpret correctly nested quotes, nested parentheses, and periods that may either end a sentence or be part of a number or abbreviation. This is a difficult task, however, and the results may not always be perfect. An RWUBreakSearch that searches for sentence breaks could be used to count the sentences in a string.

• Line breaks occur at positions where it would be appropriate to wrap text from one display line to the next. An RWUBreakSearch that searches for line breaks is useful in the creation of line-wrapping algorithms.

Instances of RWUBreakSearch are used by other classes in the Internationalization Module to find breaks in text in a locale-sensitive manner. For example, RWUStringSearch performs flexible, collation-based string searches, using the rules encapsulated by an RWUCollator and an optional RWUBreakSearch to determine if and where a match occurs (“Locale-Sensitive String Searching”). Similarly, RWURegularExpression uses an RWUBreakSearch internally to find break-related matches (“Regular Expression String Searching”).

Creating an RWBreakSearch

RWUBreakSearch objects are created given:

• the type of boundary to be analyzed. The BreakType enum specifies the type of boundary. The enumerated values are CodePoint, Line, Sentence, Word, and Character.

• an RWUString that provides text for processing

• (optional) a locale name. If no locale is specified, then the current default locale is used.

For example, this code creates an RWUBreakSearch that can be used to search the RWUString myString for character breaks based on the current default locale:

RWUBreakSearch searcher(RWUBreakSearch::Character, myString);

Using an RWBreakSearch

Once a break search is instantiated, breaks can be queried using first(), last(), next(), and previous() methods. An RWUBreakSearch object maintains a current position. Initially, the current position is the start of the source string. Calls to first(), last(), next(), and previous() alter the current position.

NOTE >>	Breaks are interpreted as being between characters, immediately to the left of the current position.

For example, the following code counts the number of sentences in a string:

RWUConversionContext context("UTF-8"); //1

RWUString str("Unicode 3.2 is a minor version of the " //2

"Unicode Standard. It overrides certain features of "

"Unicode 3.1, and adds a significant number of coded "

"characters.");

RWUBreakSearch searcher(RWUBreakSearch::Sentence, str); //3

RWUConstStringIterator iter = str.beginCodeIterator(); //4

RWUConstStringIterator end = str.endCodePointIterator(); //5

int count = 0;

while (iter != end) {

++count; //6

iter = searcher.next();

} // while

std::cout << "Found " << count << " sentences." << std::endl;

//1 Indicates that source and target strings are encoded as UTF-8.

//2 Initializes a Unicode string.

//3 Creates an RWUBreakSearch capable of finding sentence breaks, based on the default locale.

//4 Finds the beginning of the first sentence.

//5 Finds the end of the last sentence.

//6 Counts the sentences in the string.

Note that for all types of break searches, breaks often occur both before and after each unit being queried. For example, there are a total of four character breaks in the string abc. There is a break before the a, before the b, before the c, and after the c. This may require special handling of the ends of strings. For example, consider the following loop:

RWUString str;

RWUBreakSearch searcher(RWUBreakSearch::Character, str);

RWUConstStringIterator it;

for (it = searcher.first();

it != str.endCodePointIterator();

it = searcher.next())

{...}

If the character break that is located at the str.endCodePointIterator() position (like the break after the c above) should be processed, then you must take care to process it outside the body of the loop.

Direct Queries

RWUBreakSearch supports direct boundary queries using the isBreak() method. This method returns true if a given string position is a break. For example, this code tests whether there is a sentence break immediately to the left of the 12th code point in str:

RWUString str;

RWUBreakSearch searcher(RWUBreakSearch::Sentence, str);

RWUConstStringIterator it = str.beginCodePointIterator();

it.advanceCodePoints(11);

if (searcher.isBreak(it)) {

// Do something here...

}