Locale-Sensitive String Searching
RWUStringSearch allows for flexible, collator-based string searches, unlike the string searches performed by
RWUString (
“Lexical String Searching”).
RWUString uses simple lexical comparisons of the code units in the strings, but
RWUStringSearch employs the rules encapsulated by an
RWUCollator and an optional
RWUBreakSearch to determine if and where a match occurs.
RWUStringSearch provides a number of options to search for occurrences of the pattern string in a text string:
• iteration-style searches using the first(), last(), next(), and previous() methods
• direct queries related to an iterator offset using the isMatch() method
• search and replace functionality using the replace() method
Creating an RWUStringSearch
RWUStringSearch objects are created given:
• an
RWUString that specifies the pattern to search for
• an
RWUCollator that encapsulates locale-sensitive string comparison rules (see Chapter 6)
If an
RWUBreakSearch is used, a substring is considered a match only if it falls on boundaries returned by the break search object. This makes it possible, for example, to search for entire words or entire sentences.
For example, this code creates an
RWUStringSearch that can be used to search the
RWUString text for occurrences of
RWUString pattern using the string comparison rules encapsulated by
RWUCollator collator:
RWUConversionContext context("UTF-8"); //1
RWUString pattern("UTF-8"); //2
RWUString text("Utf8 serializes a Unicode code point "
"as a sequence of one to four bytes. Table 3-1 of "
"The Unicode Standard shows the bit distribution used "
"in utf-8.");
RWUCollator collator; //3
collator.setStrength(RWUCollator::Primary);
collator.enablePunctuationShifting(true);
RWUStringSearch searcher(pattern, text, collator);
Iteration-Style Searches
In iterator-style searches,
RWUStringSearch, like
RWUBreakSearch, maintains a “current” position within the source string. Immediately after construction, the current position has no meaning. A call to
first() or
last() sets the current position to the code unit offset just past that of the first or last match, and returns the location of the beginning of the match. The
next() method advances the current position to the code unit offset immediately following that of the next match, and returns the location of the beginning of the match. The
previous() method moves the current position to the beginning of the previous match, and returns the same location.
For example, this code counts the number of occurrences of pattern in text:
RWUStringSearch searcher(pattern, text, collator);
int count = 0;
while (searcher.next() != text.endCodePointIterator()) ++count;
std::cout << "Pattern was found " << count << " times."
<< std::endl;
Direct Queries
RWUStringSearch supports direct match queries using the
isMatch() method. This method returns
true if a specified iterator offset in the search string begins a match for the pattern string.
For example, this code tests if there is a match starting at the fifth code point in a text:
RWUStringSearch searcher(pattern, text, collator);
RWUConstStringIterator iter =
text.beginCodePointIterator().advanceCodePoints(5);
if (searcher.isMatch(iter)) {
// Do something here...
}
Search and Replace
RWUStringSearch supports search and replace functionality using the
replace() method. This method searches a given
RWUString for matches with the pattern stored in the
RWUStringSearch object. Each match is replaced with a given replacement
RWUString, up to a specified number of occurrences. The default number of occurrences to replace is
1. To replace all occurrences of the pattern, specify
0 occurrences. The method returns the number of occurrences actually replaced.
For example, this code replaces all occurrences of pattern in text2 with replacement:
RWUStringSearch searcher(pattern, text, collator);
RWUString replacement("UTF-16");
int32_t count = searcher.replace(text2, replacement, 0);
std::cout << "Replaced " << count << " occurrences." << std::endl;