Internationalization Module User’s Guide : Appendix A Glossary
Appendix A Glossary
abstract character
An atomic unit of textual communication; for example, the letter Q.
abstract character repertoire
See character set.
ANSI
An acronym for the American National Standards Institute. When referring to character sets, ANSI is often used as the collective name for all Microsoft Windows code pages, and sometimes is used to specify code page 1252, which is a superset of ISO/IEC 8859-1.
ASCII
An acronym for American Standard Code for Information Interchange, a 7-bit code that is the U.S. national variant of ISO/IEC 646. Formally, the U.S. standard ANSI X3.4.
base character
A character that does not graphically combine with preceding characters and that is neither a control nor a format character.
basic source character set
The set of characters used to compose a C++ source program, as defined by the C++ Standard.
canonical equivalence
The fundamental equivalence between individual Unicode characters and sequences of Unicode characters. Appropriately rendered, canonical equivalents are indistinguishable.
character
See abstract character.
character encoding
A synonym for coded character set.
character encoding form
A mapping from a coded character set to the actual code units used to represent the data.
character encoding scheme
A character encoding form plus byte serialization. There are five character encoding schemes in Unicode: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
character equivalence
Different Unicode code points or sequences of code points considered equivalent forms of the same information. See also equivalence.
character map
A complete mapping from abstract characters to code points to code units to bytes. A character map thus implicitly includes a coded character set, a character encoding form, and a character encoding scheme. The Internet Assigned Numbers Authority (IANA) charset identifiers refer to character maps.
character properties
A set of property names and property values associated with individual characters in the Unicode Character Database.
character set
An unordered collection of abstract characters used to represent textual information. Also called an abstract character repertoire.
coded character repertoire
A synonym for coded character set.
coded character scheme
A mapping of code units into byte sequences. Whereas a character encoding form maps code points to code units, a character encoding scheme maps code units to bytes.
coded character set
A mapping from a set of abstract characters to a set of non-negative integers.
code page
A coded character set. Usually refers to the coded character set used by a personal computer; for example, PC code page 437, the default coded character set used by the U.S. English version of the DOS operating system.
code point
The integer associated with an abstract character in a coded character set.
code unit
The fundamental binary width in a computer architecture used for representing character data, such as 7 bits, 8 bits, 16 bits, or 32 bits. Depending on the character encoding form used, each code point in a coded character set may be represented internally by one or more code units.
collation
The process of sorting and comparing text strings.
combining character
A character that graphically combines with a preceding base character; for example, a diacritic mark.
compatibility character
A Unicode character provided for round-trip compatibility with some preexisting character encoding standard.
compatibility equivalence
The equivalence between compatibility characters and existing nominal characters. For example, the compatibility character ½ (U+00BD) corresponds to the nominal sequence 1/2 (U+0031, U+2044, and U+0032).
composite character
See decomposable character.
conversion
The process of mapping characters from one character encoding to another. Note that conversion does not change the characters themselves; it merely changes the numbers used to represent those characters within the computer.
conversion context
A context that specifies the converter to use for implicit conversions.
converter
An object that converts text from one encoding to another. See conversion.
DBCS
Abbreviation for double-byte character set.
decomposable character
A character that is equivalent to a base character followed by one or more combining characters. Also known as a composite character.
decomposition
The process of separating or analyzing a text element into component units.
diacritic
A mark applied or attached to a symbol, such as an accent mark.
double-byte character set
A character set encoded with two bytes per character. This term is generally used in contrast with SBCS and/or MBCS. Abbreviated as DBCS.
EBCDIC
Abbreviation for Extended Binary-Coded Decimal Interchange Code. A group of coded character sets used on mainframes that consist of 8-bit encodings.
encoded character
The association of an abstract character with a numeric code point in a character encoding.
equivalence
In the context of text processing, the process or result of establishing whether two text elements are identical in some respect.
execution character set
The character set used when a C++ application is executing. May be distinct from the basic source character set. Also called machine character set.
explicit conversion
A conversion performed by an explicitly specified converter. See also implicit conversion.
fixed width encoding
A character encoding form whose code unit sequences are all of the same length. For example, single-byte character sets (SBCS) are fixed width. If a double-byte character set (DBCS) always uses two code units to represent a code point, then it is also fixed width.
glyph
A particular image which represents an abstract character or part of a character. Many glyphs may be used to denote the same character — for example, QQQQQ. Although recognizably different, these glyphs all represent the same abstract character Q.
IANA
An acronym for Internet Assigned Numbers Authority.
ICU
An acronym for International Components for Unicode. ICU is a set of open source libraries written in C and C++, developed and maintained by IBM.
implicit conversion
A conversion performed by the converter specified in the current conversion context.
informative property
A character property that is not a normative property, but that contributes to the correct use and implementation of the Unicode Standard. Informative properties may differ between conforming implementations of the Unicode Standard.
internationalization
The process of creating an application that can support a variety of languages and related cultural conventions.
locale
A set of conventions determined by human language and customs, as defined within a particular user community These conventions include a particular written language, sorting orders, and formats for dates and numbers.
localization
The process of configuring an application to support a particular language and related cultural conventions.
machine character set
See execution character set.
MBCS
Abbreviation for multibyte character set.
multibyte character set
A character set encoded with a variable number of bytes per character. Abbreviated as MBCS. See also DBCS and SBCS.
normalization
The process of converting Unicode text to a particular unique representation.
normalization form
A scheme for uniquely representing text in Unicode, as defined by the Unicode Standard Annex #15, Unicode Normalization Forms:http://www.unicode.org/unicode/reports/tr15/
normative property
A property that is required for conformance with the Unicode Standard. Normative properties may not differ between conforming implementations of the Unicode Standard. See also informative property.
regular expression
A sequence of characters that represents a pattern.
resource bundle
A collection of resources associated with a given locale.
SBCS
An abbreviation for single-byte character set.
single-byte character set
A character set encoded with one byte per character. This term is generally used in contrast with DBCS or MBCS. Abbreviated as SBCS.
surrogate pair
A sequence of two code units used to represent a single UTF-16 code point in the range 0x10000 to 0x10FFFF.
UCS-2
The Universal Character Set coded in 2 octets, as specified by the ISO-10646 Standard. UCS-2 is a two-byte fixed width subset of UTF-16. See Appendix C, “Relationship to ISO/IEC 10646,” in the Unicode Standard.
UCS-4
The Universal Character Set coded in 4 octets, as specified by the ISO-10646 Standard. UCS-4 is equivalent to UTF-32. See Appendix C, “Relationship to ISO/IEC 10646,” in the Unicode Standard.
Unicode
A standard for representing on a computer virtually all of the characters of most scripts of the world.
Unicode Character Database
A collection of files providing Unicode character properties and mappings:http://www.unicode.org/ucd/
universal character name
An escape sequence of the form \uXXXX, where XXXX is a hexadecimal value specifying a code point in the ISO/IEC 10646 and Unicode coded character sets.
UTF-8
A character encoding form for Unicode characters. Each 21-bit Unicode code point is represented using one to four 8-bit code units.
UTF-16
A character encoding form for Unicode characters. Each 21-bit Unicode code point is represented using one or two 16-bit code units. UTF-16BE and UTF-16LE are particular character encoding schemes for UTF-16.
UTF-16BE
A character encoding scheme for UTF-16 that serializes code units in big-endian format.
UTF-16LE
A character encoding scheme for UTF-16 that serializes code units in little-endian format.
UTF-32
A character encoding form for Unicode characters. Each 21-bit Unicode code point is represented using a single 32-bit code unit. UTF-32BE and UTF-32LE are particular character encoding schemes for UTF-32.
UTF-32BE
A character encoding scheme for UTF-32 that serializes code units in big-endian format.
UTF-32LE
A character encoding scheme for UTF-32 that serializes code units in little-endian format.
variable width encoding
A character encoding form whose sequences are not all of the same length. If a double-byte character set uses one or two code units to represent a code point, then it is a variable width encoding. Multibyte character sets (MBCS) are variable width.