RWUCollator

Internationalization Module Reference Guide
Rogue Wave web site: Home Page | Main Documentation Page

RWUCollator

Module: Internationalization Module Group: Unicode String Processing


Does Not Inherit

Local Index
Header File
Description
Example
Related Classes
Public Typedefs
Public Constructors
Public Destructor
Public Member Operators
Public Member Functions

Local Index

Members

CaseOrder
CollationStrength
compareTo()
enableCaseLevel()
enableFrenchCollation()

enableNormalizationChecking()
enablePunctuationShifting()
equals()
getCaseOrder()
getCollationKey()

getLocale()
getStrength()
isEnabledCaseLevel()
isEnabledFrenchCollation()
isEnabledNormalizationChecking()

isEnabledPunctuationShifting()
operator=()
RWUCollator()
setCaseOrder()
setStrength()

Header File

#include <rw/i18n/RWUCollator.h>

Description

RWUCollator performs locale-sensitive string comparison for use in searching and sorting natural language text.

Each language has its own rules for determining the proper collation order for strings. For example, in Lithuanian, the letter y appears between i and k in the alphabet. In order to take language-specific conventions into account, each RWUCollator is associated with an RWULocale at construction time. This locale specifies the default values for a variety of RWUCollator attributes. Many of these default values can be overridden using attribute mutator methods.

RWUCollator follows the Unicode Collation Algorithm, as described in Unicode Technical Standard #10:

http://www.unicode.org/unicode/reports/tr10/.

This collation algorithm can be customized using the attribute mutator methods of the RWUCollator class. With these methods, you can specify how collation elements are found, how collation weights are formed, and which collation levels should be considered significant. See the Internationalization Module User's Guide for more information on collation.

RWUCollator calculates collation weights incrementally. This ensures good performance, as most strings differ in their first few characters. However, if string comparisons are to be made repeatedly (for example, when sorting a set of strings), then best performance can be achieved by obtaining an RWUCollationKey for each string and comparing the keys. Generating a key via RWUCollator::getCollationKey() is a non-trivial operation, as it involves determining the collation elements and weights for an entire string. Comparing two RWUCollationKey objects, however, is fast.

Example

#include <rw/i18n/RWUCollator.h>
#include <rw/i18n/RWUConversionContext.h>
#include <iostream>

using std::cout;
using std::endl;

int
main()
{
  // Indicate string literals are encoded according to
  // ISO-8859-1.
  RWUConversionContext context("ISO-8859-1");

  // Use implicit conversion to build two strings.
  RWUString string1("Blackbird");
  RWUString string2("black-bird");

  // Create a collator based on the "en" locale.
  RWUCollator collator("en");
  
  // Modify the collator so it ignores differences
  // in punctuation and case.
  collator.enablePunctuationShifting(true);
  collator.setStrength(RWUCollator::Secondary);
  
  // Compare the two strings.
  int retval = collator.compareTo(string1, string2);
  if (retval < 0) {
    cout << "string1 is less than string2" << endl;
  } else if (retval == 0) {
    cout << "string1 is equal to string2" << endl;
  } else {
    cout << "string1 is greater than string2" << endl;
  } // else

  return 0;
} // main

Results:
========

string1 is equal to string2

Public Typedefs

enum CaseOrder { Normal,
                 LowerFirst,
                 UpperFirst
};

A CaseOrder value determines how characters are ordered at the tertiary level or, if enabled, the case level.

In Normal case order, characters are ordered in accordance with the Unicode Collation Charts. Typically, the lowercase version of a letter is ordered before all other versions.

In LowerFirst case order, lowercase letters, small kana, and uncased characters are ordered before mixed-case letters. Uppercase letters are ordered last.

In UpperFirst case order, uppercase letters are ordered before mixed-case letters. Lowercase letters, small kana, and uncased characters are ordered last.

enum CollationStrength { Primary,
                         Secondary,
                         Tertiary,
                         Quaternary, 
                         Identical
};

A CollationStrength value indicates the level at which two collation elements should be considered equal.

At the Primary level, only primary differences are considered significant. Primary differences are locale-dependent, but are typically differences in basic character identity. An example of a primary difference is a != b.

At the Secondary level, both primary and secondary differences are considered significant. Secondary differences are locale-dependent, but are typically differences in diacritics. An example of a secondary difference is a != á.

At the Tertiary level, primary, secondary, and tertiary differences are considered significant. Tertiary differences are locale-dependent, but are typically differences in appearance, such as the differences between uppercase, lowercase, superscript, subscript, halfwidth, and circled versions of a character. An example of a tertiary difference is a != A.

At the Quaternary, primary, secondary, tertiary, and quaternary differences are considered significant. Quaternary strength is useful only in two situations:

When punctuation shifting is enabled, whitespace and punctuation characters are ignored at the first three strength levels, and are distinguished at the quaternary level.
For Japanese locales, hiragana characters are positioned before katakana characters at the quaternary level, mimicking JIS sort order.

At the Identical level, all differences are considered significant. This strength level should be used sparingly. It rarely distinguishes between strings considered equal at the quaternary level, yet enacts a significant performance cost.

Public Constructors

RWUCollator(const RWULocale& locale = 
            RWULocale::getDefault());

Constructs a new RWUCollator based on the given locale. Throws RWUException if any error occurs during the construction.

RWUCollator(const RWUCollator &original);

Copy constructor. Makes self a deep copy of original. Throws RWUException if any error occurs during the construction.

Public Destructor

~RWUCollator(void);

Destructor.

Public Member Operators

RWUCollator&
operator=(const RWUCollator &rhs);

Assignment operator. Makes self a deep copy of rhs. Throws RWUException if any error occurs during the construction.

Public Member Functions

int
compareTo(const RWUString& lhs, const RWUString& rhs) const;

Compares the given strings, according to the dictates of this collator's attributes. Returns -1 if lhs < rhs, 0 if lhs == rhs, and 1 if lhs > rhs.

void
enableCaseLevel(bool caseLevel);

Sets whether case distinctions should be made at an extra "case level," positioned between the secondary and tertiary levels:

If self's strength is Primary, base character identity is taken into consideration, then case distinctions are made. Diacritics are not taken into account.
If self's strength is Secondary, base character identity, diacritics, and case distinctions are taken into account, in that order. Other tertiary distinctions, such as those between regular and superscript versions of a character, are not taken into account.
If self's strength is Tertiary, base character identity, diacritics, case distinctions, and other tertiary distinctions are taken into account, in that order.

At the case level, cased characters are ordered according to self's CaseOrder attribute.

void
enableFrenchCollation(bool frenchCollation);

Sets whether French collation rules should be in effect for self.

When French collation rules are in effect, the diacritical differences at the secondary strength level are compared in reverse order, from the end of each string to its start.

void
enableNormalizationChecking(bool check);

Sets whether self should perform normalization checks on input strings.

When normalization checking is disabled, self correctly compares strings that are in FCD (Fast C or D) form--that is, strings whose raw, recursive decomposition (without reordering of diacritics) results in a canonically-ordered string. Most strings in many languages are in FCD form.

In contrast, normalization checking is enabled by default for languages that use multiple combining characters, such as Arabic, Hebrew, Hindi, Thai, and Vietnamese. This ensures that input strings are normalized if necessary before collation. If, however, you know your strings are already in FCD form, you can improve performance slightly by disabling normalization checking.

void
enablePunctuationShifting(bool shift);

Sets whether the significance of punctuation and whitespace characters should be shifted from the primary strength level to the quaternary strength level.

bool
equals(const RWUString& lhs, const RWUString& rhs) const;

Compares the given strings, according to the dictates of this collator's attributes. Returns true if lhs == rhs; otherwise, false.

CaseOrder
getCaseOrder(void) const;

Returns the current CaseOrder for self.

RWUCollationKey
getCollationKey(const RWUString& str) const;

Returns an RWUCollationKey corresponding to the given string str. This key may be compared to other keys produced by collators with the same attributes.

RWULocale
getLocale(void) const;

Returns the locale associated with self.

CollationStrength
getStrength(void) const;

Returns the CollationStrength associated with self.

bool
isEnabledCaseLevel(void) const;

Returns true if the case level is enabled; otherwise, false.

bool
isEnabledFrenchCollation(void) const;

Returns true if French collation rules are in effect; otherwise, false.

bool
isEnabledNormalizationChecking(void) const;

Returns true if normalization checking is enabled; otherwise, false.

bool
isEnabledPunctuationShifting(void) const;

Returns true if punctuation shifting is enabled; otherwise, false.

void 
setCaseOrder(CaseOrder order);

Sets the case ordering for self to order.

void
setStrength(CollationStrength strength);

Sets the collation strength of self to strength.

© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.

RWUCollator

Local Index

Members

Header File

Description

Example

Related Classes

Public Typedefs

Public Constructors

Public Destructor

Public Member Operators

Public Member Functions