Boundary Analysis
RWUBreakSearch finds the locations of code point, character, word, sentence, and line breaks in text.
• Code point breaks occur before and after each code point.
• Character breaks occur between characters, as defined from the end user's perspective. For example, an accented character can be represented by a single code point or by a pair of code points (one for the base character and another for the accent symbol). Character breaks occur on either side of this logical character, regardless of the number of code points used to represent it. An
RWUBreakSearch that searches for character breaks can be used to iterate over the logical characters in a string.
• Word breaks occur before and after each word. They do not occur before and after punctuation contained within a word—such as a hyphen or an apostrophe—but they do occur before and after characters that are not part of a word, such as symbols and other punctuation marks. Note that, in some languages, words are not necessarily surrounded by whitespace. An
RWUBreakSearch that searches for word breaks is useful in the creation of operations that find whole words.
• Sentence breaks occur between sentences.
RWUBreakSearch attempts to interpret correctly nested quotes, nested parentheses, and periods that may either end a sentence or be part of a number or abbreviation. This is a difficult task, however, and the results may not always be perfect. An
RWUBreakSearch that searches for sentence breaks could be used to count the sentences in a string.
• Line breaks occur at positions where it would be appropriate to wrap text from one display line to the next. An
RWUBreakSearch that searches for line breaks is useful in the creation of line-wrapping algorithms.
Instances of
RWUBreakSearch are used by other classes in the Internationalization Module to find breaks in text in a locale-sensitive manner. For example,
RWUStringSearch performs flexible, collation-based string searches, using the rules encapsulated by an
RWUCollator and an optional
RWUBreakSearch to determine if and where a match occurs (
“Locale-Sensitive String Searching”). Similarly,
RWURegularExpression uses an
RWUBreakSearch internally to find break-related matches (
“Regular Expression String Searching”).
Creating an RWBreakSearch
RWUBreakSearch objects are created given:
• the type of boundary to be analyzed. The
BreakType enum specifies the type of boundary. The enumerated values are
CodePoint,
Line,
Sentence,
Word, and
Character.
• an
RWUString that provides text for processing
• (optional) a locale name. If no locale is specified, then the current default locale is used.
For example, this code creates an
RWUBreakSearch that can be used to search the
RWUString myString for character breaks based on the current default locale:
RWUBreakSearch searcher(RWUBreakSearch::Character, myString);
Using an RWBreakSearch
Once a break search is instantiated, breaks can be queried using
first(),
last(),
next(), and
previous() methods. An
RWUBreakSearch object maintains a current position. Initially, the current position is the start of the source string. Calls to
first(),
last(),
next(), and
previous() alter the current position.
NOTE >> Breaks are interpreted as being between characters, immediately to the left of the current position.
For example, the following code counts the number of sentences in a string:
RWUConversionContext context("UTF-8"); //1
RWUString str("Unicode 3.2 is a minor version of the " //2
"Unicode Standard. It overrides certain features of "
"Unicode 3.1, and adds a significant number of coded "
"characters.");
RWUBreakSearch searcher(RWUBreakSearch::Sentence, str); //3
RWUConstStringIterator iter = str.beginCodeIterator(); //4
RWUConstStringIterator end = str.endCodePointIterator(); //5
int count = 0;
while (iter != end) {
++count; //6
iter = searcher.next();
} // while
std::cout << "Found " << count << " sentences." << std::endl;
Note that for all types of break searches, breaks often occur both before and after each unit being queried. For example, there are a total of four character breaks in the string abc. There is a break before the a, before the b, before the c, and after the c. This may require special handling of the ends of strings. For example, consider the following loop:
RWUString str;
RWUBreakSearch searcher(RWUBreakSearch::Character, str);
RWUConstStringIterator it;
for (it = searcher.first();
it != str.endCodePointIterator();
it = searcher.next())
{...}
If the character break that is located at the str.endCodePointIterator() position (like the break after the c above) should be processed, then you must take care to process it outside the body of the loop.
Direct Queries
RWUBreakSearch supports direct boundary queries using the
isBreak() method. This method returns
true if a given string position is a break. For example, this code tests whether there is a sentence break immediately to the left of the
12th code point in
str:
RWUString str;
RWUBreakSearch searcher(RWUBreakSearch::Sentence, str);
RWUConstStringIterator it = str.beginCodePointIterator();
it.advanceCodePoints(11);
if (searcher.isBreak(it)) {
// Do something here...
}