SourcePro® API Reference Guide

 
List of all members | Public Types | Static Public Member Functions

Converts a string into a particular normalized Unicode form, and detects whether a string is already in a particular form. More...

#include <rw/i18n/RWUNormalizer.h>

Public Types

enum  CheckResult { Yes, No, Maybe }
 
enum  NormalizationForm { FormNFD, FormNFKD, FormNFC, FormNFKC }
 

Static Public Member Functions

static RWUString normalize (const RWUString &source, NormalizationForm form=FormNFC)
 
static CheckResult quickCheck (const RWUString &source, NormalizationForm form=FormNFC)
 
static CheckResult quickFcdCheck (const RWUString &source)
 

Detailed Description

RWUNormalizer converts a string into a particular normalized form, and detects whether a string is already in a particular form.

Many text strings can be represented by more than one sequence of Unicode characters. This is because the Unicode standard recognizes two types of character equivalence:

These two types of character equivalence give rise to four normalization forms:

Each normalization form produces a unique representation for a given string.

Note that two of the normalization forms, NFD and NFKD, replace composite characters with their canonical decompositions. The other two forms, NFC and NFKC, perform the opposite operation–they replace sequences of characters with canonical composites, where possible.

Also note that two of the normalization forms, NFD and NFC, do not affect compatibility characters. These normalization forms are non-lossy; that is, a string may be converted to NFD or NFC with no loss of information. The other two forms, NFKD and NFKC, replace compatibility characters with their nominal equivalents. As compatibility characters may differ in appearance from their nominal equivalents, information may be lost in converting a string to NFKD or NFKC. In other words, converting to NFKD or NFKC is a lossy operation.

Example
#include <rw/i18n/RWUNormalizer.h>
#include <rw/i18n/RWUConversionContext.h>
#include <iostream>
using std::cout;
using std::endl;
int
main()
{
// Indicate string literals are encoded according to
// ISO-8859-1.
RWUConversionContext context("ISO-8859-1");
// Use implicit conversion to build a string.
RWUString str("The French for \"student\" is \"&eacute;l&egrave;ve.\"");
// Determine whether the string is in NFD form.
// If necessary, normalize the string.
if (result != RWUNormalizer::Yes) {
cout << "Normalizing to NFD..." << endl;
} else {
cout << "String is already in NFD." << endl;
} // else
return 0;
} // main

Program Output:

Normalizing to NFD...

Member Enumeration Documentation

The quickCheck() and quickFcdCheck() return a CheckResult value to indicate whether a string is in a particular form.

Enumerator
Yes 

indicates that the string is in the specified form.

No 

indicates that the string is not in the specified form.

Maybe 

indicates that the check was inconclusive.

A NormalizationForm value indicates a particular normalization form, as defined by the Unicode Standard Annex #15, "Unicode Normalization Forms," http://www.unicode.org/reports/tr15/.

Enumerator
FormNFD 

Canonical decomposition

FormNFKD 

Compatibility Decomposition

FormNFC 

Canonical Decomposition, followed by Canonical Composition

FormNFKC 

Compatibility Decomposition, followed by Canonical Composition

Member Function Documentation

static RWUString RWUNormalizer::normalize ( const RWUString source,
NormalizationForm  form = FormNFC 
)
static

Converts the characters contained in source into the normalization form specified by form, and returns a new string containing the normalized characters.

When converting a string to any of the normalization forms, normalize() leaves US-ASCII characters unaffected, and replaces deprecated characters. normalize() never introduces compatibility characters.

Exceptions
RWUExceptionThrown if there are any errors.
static CheckResult RWUNormalizer::quickCheck ( const RWUString source,
NormalizationForm  form = FormNFC 
)
static

Determines whether source is in the normalization form specified by form. Returns Yes if source is in form, No if source is not in form, and Maybe if no determination can be made quickly.

Exceptions
RWUExceptionThrown if there are any errors.
static CheckResult RWUNormalizer::quickFcdCheck ( const RWUString source)
static

Determines whether source is in Fast C or D (FCD) form. Strictly speaking, FCD is not a normalization form, since it does not specify a unique representation for every string. Instead, it describes a string whose raw decomposition, without character reordering, results in an NFD string. Thus, all NFD, most NFC, and many unnormalized strings are already in FCD form. Such strings may be collated via RWUCollator without further normalization.

Returns Yes if source is FCD, No if source is not FCD, and Maybe if no determination could be made quickly.

Exceptions
RWUExceptionThrown if there are any errors.

Copyright © 2023 Rogue Wave Software, Inc., a Perforce company. All Rights Reserved.