Internationalization Module User’s Guide : Chapter 2 Concepts : The Unicode Standard
The Unicode Standard
The Unicode Standard includes a universal character map for written characters and text. It defines a consistent way of encoding multilingual text and a common scheme for the exchange and manipulation of such text. The Unicode Standard defines:
a coded character set
a set of character encoding forms for that coded character set
a set of character encoding schemes for those encoding forms
The sections that follow examine each of these components individually. For more information on these concepts generally, see “Representing Text in Computers”.
Unicode Coded Character Set
Unicode is a coded character set. It assigns numeric values from 0 to 0x10FFFF to abstract characters.
The Unicode Standard provides the capacity to encode nearly every character used in all of the writing systems of the world. It provides a unique integer to represent every character, no matter what the platform, no matter what the program, no matter what the language.
No escape sequences or control codes are required to specify any characters. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently. Characters from different scripts may be mixed and processed together as required.
In text, Unicode code points are usually expressed as U+n, where n is from four to six hexadecimal digits, using the digits 0-9 and A-F (for 10-15). Leading zeros are not used, unless the code point would have fewer than four hexadecimal digits. For example, U+00E9 represents the Unicode code point for é. This is the convention following in the documentation for the Internationalization Module, including this manual.
Unicode Character Encoding Forms
Any character in the Unicode character set can be expressed using 21-bits. The Unicode Standard defines three character encoding forms for representing each 21-bit code point in memory:
UTF-8
Each 21-bit code point is represented using one to four 8-bit code units.
UTF-16
Each 21-bit code point is represented using one or two 16-bit code units.
UTF-32
Each 21-bit code point is represented using a single 32-bit code unit.
The UTF-16 encoding form strikes a balance between ease of use and efficient use of memory. Most characters can be represented with a single 16-bit code unit. Only characters in the range 0x10000 to 0x10FFFF must be represented with a surrogate pair of two UTF-16 code units.
The Internationalization Module uses UTF-16 for the internal representation and manipulation of multilingual text.
Unicode Character Encoding Schemes
Unicode 3.0 defines five character encoding schemes: UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.
UTF-16BE and UTF-16LE, as well as UTF-32BE and UTF-32LE, differ depending on whether code units are serialized in big-endian or little-endian format, respectively.
NOTE >> This document uses UTF-16 to mean either UTF-16BE or UTF-16LE, whichever is the natural format for the platform in use.