Module: Internationalization Module Group: Unicode String Processing
Does Not Inherit
#include <rw/i18n/RWUCollator.h>
RWUCollator performs locale-sensitive string comparison for use in searching and sorting natural language text.
Each language has its own rules for determining the proper collation order for strings. For example, in Lithuanian, the letter y appears between i and k in the alphabet. In order to take language-specific conventions into account, each RWUCollator is associated with an RWULocale at construction time. This locale specifies the default values for a variety of RWUCollator attributes. Many of these default values can be overridden using attribute mutator methods.
RWUCollator follows the Unicode Collation Algorithm, as described in Unicode Technical Standard #10:
http://www.unicode.org/unicode/reports/tr10/.
This collation algorithm can be customized using the attribute mutator methods of the RWUCollator class. With these methods, you can specify how collation elements are found, how collation weights are formed, and which collation levels should be considered significant. See the Internationalization Module User's Guide for more information on collation.
RWUCollator calculates collation weights incrementally. This ensures good performance, as most strings differ in their first few characters. However, if string comparisons are to be made repeatedly (for example, when sorting a set of strings), then best performance can be achieved by obtaining an RWUCollationKey for each string and comparing the keys. Generating a key via RWUCollator::getCollationKey() is a non-trivial operation, as it involves determining the collation elements and weights for an entire string. Comparing two RWUCollationKey objects, however, is fast.
#include <rw/i18n/RWUCollator.h> #include <rw/i18n/RWUConversionContext.h> #include <iostream> using std::cout; using std::endl; int main() { // Indicate string literals are encoded according to // ISO-8859-1. RWUConversionContext context("ISO-8859-1"); // Use implicit conversion to build two strings. RWUString string1("Blackbird"); RWUString string2("black-bird"); // Create a collator based on the "en" locale. RWUCollator collator("en"); // Modify the collator so it ignores differences // in punctuation and case. collator.enablePunctuationShifting(true); collator.setStrength(RWUCollator::Secondary); // Compare the two strings. int retval = collator.compareTo(string1, string2); if (retval < 0) { cout << "string1 is less than string2" << endl; } else if (retval == 0) { cout << "string1 is equal to string2" << endl; } else { cout << "string1 is greater than string2" << endl; } // else return 0; } // main Results: ======== string1 is equal to string2
RWUCollationKey, RWUNormalizer
enum CaseOrder { Normal, LowerFirst, UpperFirst };
A CaseOrder value determines how characters are ordered at the tertiary level or, if enabled, the case level.
In Normal case order, characters are ordered in accordance with the Unicode Collation Charts. Typically, the lowercase version of a letter is ordered before all other versions.
In LowerFirst case order, lowercase letters, small kana, and uncased characters are ordered before mixed-case letters. Uppercase letters are ordered last.
In UpperFirst case order, uppercase letters are ordered before mixed-case letters. Lowercase letters, small kana, and uncased characters are ordered last.
enum CollationStrength { Primary, Secondary, Tertiary, Quaternary, Identical };
A CollationStrength value indicates the level at which two collation elements should be considered equal.
At the Primary level, only primary differences are considered significant. Primary differences are locale-dependent, but are typically differences in basic character identity. An example of a primary difference is a != b.
At the Secondary level, both primary and secondary differences are considered significant. Secondary differences are locale-dependent, but are typically differences in diacritics. An example of a secondary difference is a != á.
At the Tertiary level, primary, secondary, and tertiary differences are considered significant. Tertiary differences are locale-dependent, but are typically differences in appearance, such as the differences between uppercase, lowercase, superscript, subscript, halfwidth, and circled versions of a character. An example of a tertiary difference is a != A.
At the Quaternary, primary, secondary, tertiary, and quaternary differences are considered significant. Quaternary strength is useful only in two situations:
When punctuation shifting is enabled, whitespace and punctuation characters are ignored at the first three strength levels, and are distinguished at the quaternary level.
For Japanese locales, hiragana characters are positioned before katakana characters at the quaternary level, mimicking JIS sort order.
At the Identical level, all differences are considered significant. This strength level should be used sparingly. It rarely distinguishes between strings considered equal at the quaternary level, yet enacts a significant performance cost.
RWUCollator(const RWULocale& locale = RWULocale::getDefault());
Constructs a new RWUCollator based on the given locale. Throws RWUException if any error occurs during the construction.
RWUCollator(const RWUCollator &original);
Copy constructor. Makes self a deep copy of original. Throws RWUException if any error occurs during the construction.
~RWUCollator(void);
Destructor.
RWUCollator& operator=(const RWUCollator &rhs);
Assignment operator. Makes self a deep copy of rhs. Throws RWUException if any error occurs during the construction.
int compareTo(const RWUString& lhs, const RWUString& rhs) const;
Compares the given strings, according to the dictates of this collator's attributes. Returns -1 if lhs < rhs, 0 if lhs == rhs, and 1 if lhs > rhs.
void enableCaseLevel(bool caseLevel);
Sets whether case distinctions should be made at an extra "case level," positioned between the secondary and tertiary levels:
If self's strength is Primary, base character identity is taken into consideration, then case distinctions are made. Diacritics are not taken into account.
If self's strength is Secondary, base character identity, diacritics, and case distinctions are taken into account, in that order. Other tertiary distinctions, such as those between regular and superscript versions of a character, are not taken into account.
If self's strength is Tertiary, base character identity, diacritics, case distinctions, and other tertiary distinctions are taken into account, in that order.
At the case level, cased characters are ordered according to self's CaseOrder attribute.
void enableFrenchCollation(bool frenchCollation);
Sets whether French collation rules should be in effect for self.
When French collation rules are in effect, the diacritical differences at the secondary strength level are compared in reverse order, from the end of each string to its start.
void enableNormalizationChecking(bool check);
Sets whether self should perform normalization checks on input strings.
When normalization checking is disabled, self correctly compares strings that are in FCD (Fast C or D) form--that is, strings whose raw, recursive decomposition (without reordering of diacritics) results in a canonically-ordered string. Most strings in many languages are in FCD form.
In contrast, normalization checking is enabled by default for languages that use multiple combining characters, such as Arabic, Hebrew, Hindi, Thai, and Vietnamese. This ensures that input strings are normalized if necessary before collation. If, however, you know your strings are already in FCD form, you can improve performance slightly by disabling normalization checking.
void enablePunctuationShifting(bool shift);
Sets whether the significance of punctuation and whitespace characters should be shifted from the primary strength level to the quaternary strength level.
bool equals(const RWUString& lhs, const RWUString& rhs) const;
Compares the given strings, according to the dictates of this collator's attributes. Returns true if lhs == rhs; otherwise, false.
CaseOrder getCaseOrder(void) const;
Returns the current CaseOrder for self.
RWUCollationKey getCollationKey(const RWUString& str) const;
Returns an RWUCollationKey corresponding to the given string str. This key may be compared to other keys produced by collators with the same attributes.
RWULocale getLocale(void) const;
Returns the locale associated with self.
CollationStrength getStrength(void) const;
Returns the CollationStrength associated with self.
bool isEnabledCaseLevel(void) const;
Returns true if the case level is enabled; otherwise, false.
bool isEnabledFrenchCollation(void) const;
Returns true if French collation rules are in effect; otherwise, false.
bool isEnabledNormalizationChecking(void) const;
Returns true if normalization checking is enabled; otherwise, false.
bool isEnabledPunctuationShifting(void) const;
Returns true if punctuation shifting is enabled; otherwise, false.
void setCaseOrder(CaseOrder order);
Sets the case ordering for self to order.
void setStrength(CollationStrength strength);
Sets the collation strength of self to strength.
© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.