Customizing a Collator

Internationalization Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

6.3 Customizing a Collator

Class RWUCollator follows the Unicode Collation Algorithm, as described in Unicode Technical Standard #10:

http://www.unicode.org/unicode/reports/tr10/

Conceptually, this algorithm works as follows:

For each string, find its collation elements. A collation element usually, but not necessarily, corresponds to a character. A composed character such as á, for example, corresponds to multiple collation elements: one for the letter a and one for the acute accent symbol. In contrast, traditional Spanish regards ch as a single character, so in this locale two Unicode code points correspond to a single collation element.

For each collation element, find its collation weights. Each collation element has at least three, and sometimes four, weights. Each weight gives another level of collation information for that collation element. The exact meaning of a collation level depends upon locale. For most locales:

the primary weight encodes basic character identity
the secondary weight encodes diacritical information
the tertiary weight encodes differences in appearance, such as case

For example, a and A have the same primary weight and are considered identical at the primary level, while the primary weight for a is less than that of b so a < b. At the secondary level, the weights for a and á differ. At the tertiary level, the weights for a and A differ.

Compare the primary weights of two strings. If the strings can be distinguished at the primary level, the collation is complete and the result can be returned.

If the strings are identical at the primary level, continue comparing weights of additional levels as requested until a difference is found or the strings are determined to be equivalent.

RWUCollator provides a variety of mutator methods for customizing how collation is performed. With these methods, you can specify:

how collation elements are found
how collation weights are formed
which collation levels should be considered significant

6.3.1 Finding Collation Elements

The enableNormalizationChecking() method lets you modify the process by which RWUCollator obtains a series of collation elements from a string of Unicode characters. It controls whether RWUCollator normalizes a string before finding its collation elements.

Without normalization, RWUCollator can correctly collate strings in Fast C or D form. These are strings whose raw, recursive decomposition, without re-ordering of diacritics, results in an NFD string (Normalization Form Decomposed; see Chapter 5 for more information in normalization forms). Most strings in many languages are already in FCD form. In contrast, strings in languages that use multiple combining characters--such as Arabic, Hebrew, Hindi, Thai, and Vietnamese--might not be in FCD form.

When normalization checking is enabled, RWUCollator checks input strings and normalizes them if necessary. When normalization checking is disabled, it skips the normalization check, improving performance.

The default value for the normalization check attribute is based upon locale. For example, normalization checking is enabled by default for Thai. If you know, however, that your Thai input strings are in FCD form, you can increase performance by disabling normalization checking.

The isEnabledNormalizationChecking() method returns true if normalization checking is enabled; otherwise, false.

For more information on normalization, see Chapter 5.

6.3.2 Forming Collation Weights

RWUCollator provides methods for controlling the process of forming collation weights for each collation element.

6.3.2.1 Case Order

The setCaseOrder() method controls the relative order of cased letters. For most locales, the default value of this attribute is RWUCollator::Normal, indicating that tertiary weights should be taken directly from the Unicode Collation Charts; a lower-case letter is usually ordered before the upper-case, superscript, circled, or other versions of the same letter. For the Latvian locale, the default case order is RWUCollator::UpperFirst, causing all upper-case versions of a letter to be ordered before all lower-case versions.

The getCaseOrder() method returns the current case order associated with the collator.

6.3.2.2 Punctuation Shifting

The enablePunctuationShifting() method causes whitespace and punctuation to be ignored at the primary, secondary, and tertiary levels, and to be considered significant only at the quaternary level. Punctuation shifting is disabled by default.

The isEnabledPunctuationShifting() method returns true if punctuation shifting is currently enabled; otherwise, false.

6.3.3 Examining Collation Levels

RWUCollator provides methods for controlling how weights at the various collation levels are examined.

6.3.3.1 Collation Strength

The setStrength() method determines the number of collation strength levels taken into consideration by RWUCollator. For example, setting a collator's strength to RWUCollator::Primary causes it to ignore secondary and tertiary differences in collation weights, in effect ignoring diacritical and case differences.

Quaternary strength is useful only in two situations:

When punctuation shifting is enabled (Section 6.3.2.2), whitespace and punctuation characters are ignored at the first three strength levels, and are distinguished at the quaternary level.
For Japanese locales, hiragana characters are positioned before katakana characters at the quaternary level, mimicking JIS sort order.

The default strength level for most locales is tertiary; for Japanese, it is quaternary.

The getStrength() method returns the current collation strength associated with the collator.

6.3.3.2 Case Level

The enableCaseLevel() method creates an additional level of collation information, known as the case level, that distinguishes between characters just on their case. Lower-case letters, small kana, and uncased characters are distinguished from upper-case letters, which are in turn distinguished from mixed-case digraphs such the Croation Lj. If the case level is enabled, case distinctions are made regardless of the collator's strength level. This behavior is disabled by default.

The isEnabledCaseLevel() method returns true if the case level is enabled; otherwise, false.

6.3.3.3 French Collation

The enableFrenchCollation() method determines the order in which secondary weights are compared. Normally, weights at all levels are compared from the start of the input strings to their ends. When French collation is enabled, secondary weights are compared in reverse order, from the end of the input strings to their beginnings, as is customary in French. French collation is enabled by default for French locales.

The isEnabledFrenchCollation() method returns true if the French collation is enabled; otherwise, false.

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.