RWUNormalizer

Internationalization Module Reference Guide
Rogue Wave web site: Home Page | Main Documentation Page

RWUNormalizer

Module: Internationalization Module Group: Unicode String Processing


Does Not Inherit

Local Index
Header File
Description
Example
Public Enums
Static Member Functions

Local Index

Members

CheckResult
NormalizationForm

normalize()
quickCheck()

quickFcdCheck()

Header File

#include <rw/i18n/RWUNormalizer.h>

Description

RWUNormalizer converts a string into a particular normalized form, and detects whether a string is already in a particular form.

Many text strings can be represented by more than one sequence of Unicode characters. This is because the Unicode standard recognizes two types of character equivalence:

Canonical equivalence is a fundamental equivalence between characters and sequences of characters. Correctly rendered, canonical equivalents are indistinguishable. An example of canonical equivalence is between the composite character á (Unicode code point U+00E1) and the canonical decomposition formed by the letter a and a combining acute accent symbol (Unicode code points U+0061 and U+0301). Another example of canonical equivalence is between Korean hangul syllables and the jamo characters that compose them.
Compatibility equivalence is a correspondence between nominal Unicode characters and variants included in Unicode to facilitate round-trip compatibility with other encoding standards. Typically, compatibility characters differ in appearance from their nominal counterparts. For example, the compatibility character ½ (Unicode code point U+00BD) corresponds to the nominal sequence 1/2 (Unicode code points U+0031, U+2044, and U+0032). Another example of compatibility equivalence is between circled and uncircled versions of characters.

These two types of character equivalence give rise to four normalization forms:

Normalization Form Decomposed (NFD)

Composite characters are replaced by canonical equivalents, in canonical order. Compatibility characters are unaffected.

Normalization Form Compatibility Decomposed (NFKD)

Composite characters are replaced by canonical equivalents, in canonical order. Compatibility characters are replaced by their nominal counterparts.

Normalization Form Composed (NFC)

Character sequences are replaced by canonically-equivalent composites, where possible. Compatibility characters are unaffected. The W3C generally recommends that strings be interchanged in NFC.

Normalization Form Compatibility Composed (NFKC)

Character sequences are replaced by canonically-equivalent composites, where possible. Compatibility characters are replaced by their nominal counterparts.

Each normalization form produces a unique representation for a given string.

Note that two of the normalization forms, NFD and NFKD, replace composite characters with their canonical decompositions. The other two forms, NFC and NFKC, perform the opposite operation--they replace sequences of characters with canonical composites, where possible.

Also note that two of the normalization forms, NFD and NFC, do not affect compatibility characters. These normalization forms are non-lossy; that is, a string may be converted to NFD or NFC with no loss of information. The other two forms, NFKD and NFKC, replace compatibility characters with their nominal equivalents. As compatibility characters may differ in appearance from their nominal equivalents, information may be lost in converting a string to NFKD or NFKC. In other words, converting to NFKD or NFKC is a lossy operation.

Example

#include <rw/i18n/RWUNormalizer.h>
#include <rw/i18n/RWUConversionContext.h>
#include <iostream>

using std::cout;
using std::endl;

int
main()
{
  // Indicate string literals are encoded according to
  // ISO-8859-1.
  RWUConversionContext context("ISO-8859-1");

  // Use implicit conversion to build a string.
  RWUString str("The French for \"student\" is \"élève.\"");

  // Determine whether the string is in NFD form.
  RWUNormalizer::CheckResult result;
  result = RWUNormalizer::quickCheck(str, RWUNormalizer::FormNFD);

  // If necessary, normalize the string.
  if (result != RWUNormalizer::Yes) {
    cout << "Normalizing to NFD..." << endl;
    str = RWUNormalizer::normalize(str, RWUNormalizer::FormNFD);
  } else {
    cout << "String is already in NFD." << endl;
  } // else

  return 0;
} // main

Results:
========

Normalizing to NFD...

Public Enums

enum NormalizationForm { FormNFD,
                         FormNFKD,
                         FormNFC,
                         FormNFKC
};

A NormalizationForm value indicates a particular normalization form, as defined by the Unicode Standard Annex #15, "Unicode Normalization Forms," http://www.unicode.org/unicode/reports/tr15/.

enum CheckResult { Yes,
                   No,
                   Maybe
};

The quickCheck() and quickFcdCheck() return a CheckResult value to indicate whether a string is in a particular form. Yes indicates that the string is in the specified form, No that the string is not in the specified form, and Maybe that the check was inconclusive.

Static Member Functions

static RWUString
normalize(const RWUString& source,
          NormalizationForm form = FormNFC);

Converts the characters contained in source into the normalization form specified by form, and returns a new string containing the normalized characters. Throws RWUException if there are any errors.

When converting a string to any of the normalization forms, normalize() leaves ASCII characters unaffected, and replaces deprecated characters. normalize() never introduces compatibility characters.

static CheckResult
quickCheck(const RWUString& source, 
           NormalizationForm form = FormNFC);

Determines whether source is in the normalization form specified by form. Returns YES if source is in form, NO if source is not in form, and MAYBE if no determination can be made quickly.

Throws RWUException if there are any errors.

static CheckResult
quickFcdCheck(const RWUString& source);

Determines whether source is in Fast C or D (FCD) form. Strictly speaking, FCD is not a normalization form, since it does not specify a unique representation for every string. Instead, it describes a string whose raw decomposition, without character reordering, results in an NFD string. Thus, all NFD, most NFC, and many unnormalized strings are already in FCD form. Such strings may be collated via RWUCollator without further normalization.

Returns Yes if source is FCD, No if source is not FCD, and Maybe if no determination could be made quickly. Throws RWUException if there are any errors.

© Copyright Rogue Wave Software, Inc. All Rights Reserved.
Rogue Wave and SourcePro are registered trademarks of Rogue Wave Software, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.