Character Properties

Internationalization Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

3.3 Character Properties

One of the strengths of the Unicode Standard is that it not only defines a very large character set, but also defines a comprehensive set of properties for each code point in the Unicode character set. The set of properties and the values of those properties are specified by the Unicode Character Database that is published as part of the Unicode Standard:

http://www.unicode.org/ucd

The Unicode Character Database consists of a number of data files. The latest versions of all data files are available here:

http://www.unicode.org/Public/UNIDATA/

Unicode character properties may be either normative or informative, as defined in Chapter 3, "Conformance," of the Unicode Standard:

A normative property is required for conformance with the Unicode Standard. Implementations that claim conformance to the Unicode Standard and that make use of a particular normative property must follow the specifications of the standard for that property to be conformant.
An informative property is strongly recommended, but a conformant implementation is free to use or change such values as it may require, while still remaining conformant to the standard.

In the Internationalization Module, RWUCharTraits provides access to Unicode character properties. This class defines several public enums that name property values in plain English, and a series of static methods for querying the properties of a character. For example, the static method RWUCharTraits::getScript() returns an enumerated value identifying the script property of the Unicode character with the given code point: such as Latin, Greek, Hebrew, Arabic, or Han. Most methods on RWUCharTraits take RWUChar32 code points as arguments; a few operate on RWUChar16 code units. RWUCharTraits provides access to both normative and informative properties of characters.

It is not necessary to instantiate class RWUCharTraits. All its methods are static.

3.3.1 Valid Code Points

The range of Unicode code points is 0x0 to 0x10FFFF. However, some values within this range are reserved and are not valid characters. RWUCharTraits provides the static method RWUCharTraits::isCharacter(), which returns true if a given RWUChar32 value is a valid Unicode character code point.

RWUCharTraits::isDefined() returns true if a given value RWUChar32 is defined as the code point for a named character in the Unicode Character Database. A defined character is assigned various properties under the Unicode Standard. These properties can be accessed using other methods provided by RWUCharTraits, as described in the following sections.

RWUCharTraits::isCharacter()tests whether a code point is valid, and hence may be a defined character. RWUCharTraits::isDefined() tests whether a code point has already been defined.

3.3.2 Surrogate Pairs

In UTF-16, most Unicode characters can be represented with a single 16-bit code unit. Only characters in the range 0x10000 to 0x10FFFF must be represented with a surrogate pair of two UTF-16 code units. RWUCharTraits provides the static method RWUCharTraits::requiresSurrogatePair(), which returns true if a given RWUChar32 code point requires a surrogate representation.

Similarly, RWUCharTraits::isHighSurrogate() returns true if a given RWUChar16 code unit is the first, or high, code unit of a surrogate pair. A high surrogate has a value in the range U+D800 to U+DBFF. The function RWUCharTraits::isLowSurrogate() returns true if a given RWUChar16 code unit is the second, or low, code unit of a surrogate pair. A low surrogate has a value in the range U+DC00 to U+DFFF. The method RWUCharTraits::isSurrogate() returns true if a given RWUChar16 is a surrogate in the range U+D800 to U+DFFF. Surrogates are not characters themselves; they are reserved for use as the low or high code unit in a surrogate pair.

Finally, RWUCharTraits::isSingle() returns true if a given RWUChar16 code unit corresponds to a single code point, or false if the value is part of a surrogate pair.

3.3.3 Character Blocks

A character block is a grouping of related characters within the Unicode encoding space. RWUCharTraits provides a Block enum with values that identify the various blocks, such as the BasicLatinBlock, the GreekAndCopticBlock, the BengaliBlock, the ThaiBlock, the EthiopicBlock, the CherokeeBlock, and so on. (See the documentation for RWUCharTraits in the SourcePro C++ API Reference Guide for a complete list of enumerated values.) The values in this enumeration correspond to the block names that appear in the Unicode Character Database, as described in Chapter 14, "Code Charts," of the Unicode Standard.

The static method RWUCharTraits::getBlock() returns the value in the Block enumeration that identifies the character block containing the Unicode character with a given code point.

3.3.4 Character Scripts

Every Unicode character is assigned a script name in the Unicode Character Database. The script name associated with a code point is often a better basis for distinguishing characters than the block name. Blocks are simply code point ranges; characters from the same script may be in several different blocks, while characters from different scripts may be in the same block.

RWUCharTraits provides a Script enum with values that identify the various scripts, such as Latin, Cyrillic, Hebrew, Tibetan, Runic, and so on. (See the documentation for RWUCharTraits in the SourcePro C++ API Reference Guide for a complete list of enumerated values.) The values in this enumeration correspond to the script property names defined in the Unicode Character Database, as described in Unicode Technical Report #24, "Script Names":

http://www.unicode.org/unicode/reports/tr24

The static method RWUCharTraits::getScript() returns the value in the Script enumeration that identifies the script associated with a given code point.

3.3.5 General Character Categories

Every Unicode character is also assigned to a general character category in the Unicode Character Database. RWUCharTraits provides a GeneralCategory enum with values that identify the various categories, such as UppercaseLetter, LowercaseLetter, DecimalDigitNumber, LineSeparator, ConnectorPunctuation, and so on. (See the documentation for RWUCharTraits in the SourcePro C++ API Reference Guide for a complete list of enumerated values.) The values in this enumeration correspond to the general category property codes that appear in the Unicode Character Database, as described in:

http://www.unicode.org/reports/tr44/

The static method RWUCharTraits::getGeneralCategory() returns the value in the GeneralCategory enumeration that identifies the general character category associated with a given code point. Various convenience methods are also provided, which return true if a given RWUChar32 represents a code point in a particular character category: RWUCharTraits::isControl(), RWUCharTraits::isError(), RWUCharTraits::isLetter(), RWUCharTraits::isPunctuation(), RWUCharTraits::isSpace(), and RWUCharTraits::isWhitespace(). The static method getWhitespace() returns a null-terminated array of whitespace code points, as a convenience for use as delimiters (see Section 7.3).

3.3.6 Character Names

Each Unicode character may have two different names: a deprecated name, as defined by Unicode 1.0, and a standard name, as defined in subsequent versions of the standard. These names are defined in the Unicode Character Database. For example, the standard name for the Unicode space character (U+0020) is SPACE.

RWUCharTraits provides the static method RWUCharTraits::getName(), which returns the name of the character represented by a given code point as an RWCString. An optional, second argument, getDeprecatedName, indicates whether the method should return the deprecated name for the character. The default value is false.

Conversely, RWUCharTraits::getChar32() returns the code point for the Unicode character with a given name. An optional, second argument, isDeprecatedName, indicates whether a given name is the deprecated name for the character. The default value is false.

3.3.7 Character Directionality

All Unicode characters are assigned a directionality type in the Unicode Character Database. RWUCharTraits provides a BidirectionalCategory enum with values that identify the various directionality types, such as LeftToRight, RightToLeft, RightToLeftArabic, LeftToRightEmbedding, RightToLeftOverride, and so on. (See the documentation for RWUCharTraits in the SourcePro C++ API Reference Guide for a complete list of enumerated values.) The values in this enumeration correspond to the bidirectional category property codes defined in the Unicode Character Database, as described in Unicode Standard Annex #9, "The Bidirectional Algorithm":

http://www.unicode.org/unicode/reports/tr9

The static method RWUCharTraits::getBidirectionalCategory() returns the value in the BidirectionalCategory enumeration identifying the directionality type associated with a given code point.

3.3.8 Character Width

RWUCharTraits provides an EastAsianWidth enum with values that identify the various widths: ZeroWidth, HalfWidth, FullWidth, and NeutralWidth. The values in this enumeration correspond to the East Asian width property values defined in the Unicode Character Database, as described in Unicode Standard Annex #11, "East Asian Width":

http://www.unicode.org/unicode/reports/tr11/

The static method RWUCharTraits::getEastAsianWidth() returns the value in the EastAsianWidth enumeration that identifies the default width associated with a given code point.

3.3.9 Combining Classes

Combining characters combine graphically with a preceding character. They include diacritics, Hebrew points, and Arabic vowel signs. Each Unicode character is assigned to a combining class in the Unicode Character Database.

RWUCharTraits provides a CombiningClass enum with values that identify the combining class assigned to the character, such as BaseEquivalent, HebrewPointHatafQamats, ArabicFathatan, ThaiCharacterMaiTri, TibetanVowelSignAa, and so on. (See the documentation for RWUCharTraits in the SourcePro C++ API Reference Guide for a complete list of enumerated values.) The values in this enumeration correspond to the combining classes defined in the Unicode Character Database, as described in Section 2, "Combining Classes," of Chapter 4, "Character Properties," of the Unicode Standard, http://unicode.org/versions/Unicode5.2.0/ch04.pdf.

The static method RWUCharTraits::getCombiningClass() returns the value in the CombiningClass enumeration that identifies the combining class associated with a given code point.

3.3.10 Character Case

Case is a property of some alphabets in which different characters are considered to be variants of the same letter. Unicode case mappings are described in Section 5.18 of the Unicode standard:

http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf

RWUCharTraits provides the static methods RWUCharTraits::isLower(), RWUCharTraits::isTitle(), and RWUCharTraits::isUpper(). These methods return true if a given RWUChar32 represents a code point for a lowercase, titlecase, or uppercase letter, respectively.

Static methods RWUCharTraits::toLower(), RWUCharTraits::toTitle(), and RWUCharTraits::toUpper() are also provided, which perform simple case conversions. An equivalent letter in the desired case is returned if such a mapping exists in the character database. If the letter has no equivalent in the desired case, the character itself is returned.

3.3.11 Character Mirroring

Mirroring is a property of characters, such as parentheses, whose images are reflected horizontally in text that is laid out right to left. For example, the left parens is the opening parens in left-to-right text, but in right-to-left text the mirrored right parens is the opening parens. The Unicode mirrored property is described in Unicode Standard Annex #9, "The Bidirectional Algorithm":

http://www.unicode.org/unicode/reports/tr9/

RWUCharTraits provides the static method RWUCharTraits::isMirrored(), which returns true if a given RWUChar32 is a code point for a mirrored character. If so, RWUCharTraits::getMirror() returns the code point of the character that provides a "mirror-like" image of a given character; otherwise, it returns the given code point.

3.3.12 Numeric Values

Numeric values are characters that represent numbers. This group includes Roman numerals, superscripts, subscripts, fractions, and so on. Numeric characters are identified in the Unicode Character Database according to the method described in:

http://www.unicode.org/reports/tr44/#Derived_Extracted

RWUCharTraits provides the static method RWUCharTraits::isNumeric() that returns true if a given RWUChar32 represents a code point for a numeric value. If so, RWUCharTraits::getNumericValue() returns the numeric value of a character as an int32_t.

A subset of the set of numeric characters is the set of decimal digits. This set consists of the digits 0-9, plus various display variants of these digits: circled digits, digits followed by a full stop, and so on. The static method RWUCharTraits::isDecimalDigit() returns true if a given RWUChar32 represents a code point for a decimal digit. If so, RWUCharTraits::getDecimalValue() returns the decimal value of a character as an int32_t.

Finally, the static method RWUCharTraits::isDigit() returns true if a given RWUChar32 represents a code point for a digit 0-9, without including display variants.

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.