Appendix A Glossary

An atomic unit of textual communication; for example, the letter Q.

An acronym for the American National Standards Institute. When referring to character sets, ANSI is often used as the collective name for all Microsoft Windows code pages, and sometimes is used to specify code page 1252, which is a superset of ISO/IEC 8859-1.

An acronym for American Standard Code for Information Interchange, a 7-bit code that is the U.S. national variant of ISO/IEC 646. Formally, the U.S. standard ANSI X3.4.

A character that does not graphically combine with preceding characters and that is neither a control nor a format character.

The set of characters used to compose a C++ source program, as defined by the C++ Standard.

The fundamental equivalence between individual Unicode characters and sequences of Unicode characters. Appropriately rendered, canonical equivalents are indistinguishable.

See abstract character.

A synonym for coded character set.

A mapping from a coded character set to the actual code units used to represent the data.

A character encoding form plus byte serialization. There are five character encoding schemes in Unicode: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.

Different Unicode code points or sequences of code points considered equivalent forms of the same information. See also equivalence.

A complete mapping from abstract characters to code points to code units to bytes. A character map thus implicitly includes a coded character set, a character encoding form, and a character encoding scheme. The Internet Assigned Numbers Authority (IANA) charset identifiers refer to character maps.

A set of property names and property values associated with individual characters in the Unicode Character Database.

An unordered collection of abstract characters used to represent textual information. Also called an abstract character repertoire.

A synonym for coded character set.

A mapping of code units into byte sequences. Whereas a character encoding form maps code points to code units, a character encoding scheme maps code units to bytes.

A mapping from a set of abstract characters to a set of non-negative integers.

A coded character set. Usually refers to the coded character set used by a personal computer; for example, PC code page 437, the default coded character set used by the U.S. English version of the DOS operating system.

The integer associated with an abstract character in a coded character set.

The fundamental binary width in a computer architecture used for representing character data, such as 7 bits, 8 bits, 16 bits, or 32 bits. Depending on the character encoding form used, each code point in a coded character set may be represented internally by one or more code units.

The process of sorting and comparing text strings.

A character that graphically combines with a preceding base character; for example, a diacritic mark.

A Unicode character provided for round-trip compatibility with some preexisting character encoding standard.

The equivalence between compatibility characters and existing nominal characters. For example, the compatibility character ½ (U+00BD) corresponds to the nominal sequence 1/2 (U+0031, U+2044, and U+0032).

See decomposable character.

The process of mapping characters from one character encoding to another. Note that conversion does not change the characters themselves; it merely changes the numbers used to represent those characters within the computer.

A context that specifies the converter to use for implicit conversions.

An object that converts text from one encoding to another. See conversion.

Abbreviation for double-byte character set.

A character that is equivalent to a base character followed by one or more combining characters. Also known as a composite character.

The process of separating or analyzing a text element into component units.

A mark applied or attached to a symbol, such as an accent mark.

A character set encoded with two bytes per character. This term is generally used in contrast with SBCS and/or MBCS. Abbreviated as DBCS.

Abbreviation for Extended Binary-Coded Decimal Interchange Code. A group of coded character sets used on mainframes that consist of 8-bit encodings.

The association of an abstract character with a numeric code point in a character encoding.

In the context of text processing, the process or result of establishing whether two text elements are identical in some respect.

The character set used when a C++ application is executing. May be distinct from the basic source character set. Also called machine character set.

A conversion performed by an explicitly specified converter. See also implicit conversion.

A character encoding form whose code unit sequences are all of the same length. For example, single-byte character sets (SBCS) are fixed width. If a double-byte character set (DBCS) always uses two code units to represent a code point, then it is also fixed width.

A particular image which represents an abstract character or part of a character. Many glyphs may be used to denote the same character — for example, QQQQQ. Although recognizably different, these glyphs all represent the same abstract character Q.

An acronym for Internet Assigned Numbers Authority.

An acronym for International Components for Unicode. ICU is a set of open source libraries written in C and C++, developed and maintained by IBM.

A conversion performed by the converter specified in the current conversion context.

A character property that is not a normative property, but that contributes to the correct use and implementation of the Unicode Standard. Informative properties may differ between conforming implementations of the Unicode Standard.

The process of creating an application that can support a variety of languages and related cultural conventions.

A set of conventions determined by human language and customs, as defined within a particular user community These conventions include a particular written language, sorting orders, and formats for dates and numbers.

The process of configuring an application to support a particular language and related cultural conventions.

See execution character set.

Abbreviation for multibyte character set.

A character set encoded with a variable number of bytes per character. Abbreviated as MBCS. See also DBCS and SBCS.

The process of converting Unicode text to a particular unique representation.

A scheme for uniquely representing text in Unicode, as defined by the Unicode Standard Annex #15, Unicode Normalization Forms:http://www.unicode.org/unicode/reports/tr15/

A property that is required for conformance with the Unicode Standard. Normative properties may not differ between conforming implementations of the Unicode Standard. See also informative property.

A sequence of characters that represents a pattern.

A collection of resources associated with a given locale.

An abbreviation for single-byte character set.

A character set encoded with one byte per character. This term is generally used in contrast with DBCS or MBCS. Abbreviated as SBCS.

A sequence of two code units used to represent a single UTF-16 code point in the range 0x10000 to 0x10FFFF.

The Universal Character Set coded in 2 octets, as specified by the ISO-10646 Standard. UCS-2 is a two-byte fixed width subset of UTF-16. See Appendix C, “Relationship to ISO/IEC 10646,” in the Unicode Standard.

The Universal Character Set coded in 4 octets, as specified by the ISO-10646 Standard. UCS-4 is equivalent to UTF-32. See Appendix C, “Relationship to ISO/IEC 10646,” in the Unicode Standard.

A standard for representing on a computer virtually all of the characters of most scripts of the world.

A collection of files providing Unicode character properties and mappings:http://www.unicode.org/ucd/

An escape sequence of the form \uXXXX, where XXXX is a hexadecimal value specifying a code point in the ISO/IEC 10646 and Unicode coded character sets.

A character encoding form for Unicode characters. Each 21-bit Unicode code point is represented using one to four 8-bit code units.

A character encoding form for Unicode characters. Each 21-bit Unicode code point is represented using one or two 16-bit code units. UTF-16BE and UTF-16LE are particular character encoding schemes for UTF-16.

A character encoding scheme for UTF-16 that serializes code units in big-endian format.

A character encoding scheme for UTF-16 that serializes code units in little-endian format.

A character encoding form for Unicode characters. Each 21-bit Unicode code point is represented using a single 32-bit code unit. UTF-32BE and UTF-32LE are particular character encoding schemes for UTF-32.

A character encoding scheme for UTF-32 that serializes code units in big-endian format.

A character encoding scheme for UTF-32 that serializes code units in little-endian format.

A character encoding form whose sequences are not all of the same length. If a double-byte character set uses one or two code units to represent a code point, then it is a variable width encoding. Multibyte character sets (MBCS) are variable width.