Representing Text in Computers

Internationalization Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

2.3 Representing Text in Computers

Fundamentally, computers just deal with numbers. To store letters and other characters, computers must assign a number to each one. There are hundreds of different encoding systems currently in use around the world for assigning these numbers.

The sections that follow provide a conceptual framework for understanding how these encodings work.

2.3.1 Abstract Characters

A character is an abstract, atomic unit of communication. For example, the letter Q is a character in English.

Characters and glyphs are distinct concepts. A glyph is a particular image that represents a character or part of a character. Many glyphs may be used to denote the same character; for example, Q^QQQQ. Although recognizably different, these glyphs all represent the same abstract character Q. Similarly, some characters are represented by different glyphs depending on context, such as whether they occur in isolated, initial, medial, or final position.

2.3.2 Character Sets

A character set (or abstract character repertoire) is an unordered set of characters. The set is defined by convention, such as a writing system, or the publication of a standard. Examples of character sets include the Western European alphabets and symbols of Latin-1, the POSIX portable character repertoire, the Windows Western European repertoire, the Japanese syllabaries and ideographs of JIS X 02081, and so forth.

2.3.3 Coded Character Sets

A coded character set (also called a character encoding, coded character repertoire, or code page) is a mapping from a set of abstract characters to a set of non-negative integers. The result is a set of encoded characters that can be represented numerically within the computer. The range of integers need not be contiguous. Examples of coded character sets include ISO/IEC 8859-1 (Latin-1), Windows Code Page 1252 (same repertoire as 8859-1), the Unicode Standard, and so forth.

The integer associated with an abstract character in a coded character set is called the code point for the character.

2.3.4 Character Encoding Forms

In order to represent characters in a computer, each code point in a coded character set must be mapped to a sequence of bits. This mapping is called a character encoding form.

A code unit is the fundamental binary width used in a computer architecture for representing character data, such as 7 bits, 8 bits, 16 bits, or 32 bits. Depending on the character encoding form used, each code point in a coded character set may be represented internally by one or more such code units.

A character encoding form whose code unit sequences are all of the same length is known as a fixed width encoding. For example, single-byte character sets (SBCS) are fixed width. If a double-byte character set (DBCS) always uses two code units to represent a code point, then it is also fixed width.

A character encoding form whose sequences are not all of the same length is known as a variable width encoding. If a double-byte character set uses one or two code units to represent a code point, then it is a variable width encoding. Multibyte character sets (MBCS) are variable width.

Examples of character encoding forms include:

US ASCII, a 7-bit fixed width encoding form
ISO 8859-1, an 8-bit fixed width encoding form
CP 037 and CP 500, 8-bit fixed width EBCDIC encoding forms
Windows CP 1252, an 8-bit fixed width encoding form
Shift-JIS, a 16-bit variable width encoding form for JIS X 0208
UTF-8, a variable width 8-bit encoding form for Unicode 3.0
UTF-16, a variable width 16-bit encoding form for Unicode 3.0
UTF-32, a fixed-width 32-bit encoding form for Unicode 3.0

2.3.5 Character Encoding Schemes

A coded character scheme is a character encoding form plus byte serialization. It is a mapping of code units into serialized byte sequences. Whereas a character encoding form maps code points to code units, a character encoding scheme maps code units to bytes.

Character encoding schemes are required for cross-platform persistence involving code units wider than a byte. Most fixed-width byte-oriented encoding forms have a trivial mapping from code units to bytes. Most mixed-width byte-oriented encoding forms simply serialize the sequence of code units. Encoding forms with 16-bit or 32-bit code units require schemes that specify the byte order.

For example, the UTF-16 character encoding form for Unicode 3.0 has two character encoding schemes, UTF-16BE and UTF-16LE, which specify whether the two bytes used to represent UTF-16 code units are serialized in big-endian or little-endian format, respectively.

2.3.6 Character Map

The complete mapping from abstract characters to code points to code units to bytes (see Section 2.3.2 through Section 2.3.5) is called a character map. A character map thus implicitly includes a coded character set, a character encoding form, and a character encoding scheme.

The charset identifiers recognized by the Internet Assigned Numbers Authority (IANA) refer to character maps.

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.