XML Streams Module User’s Guide : Chapter 7 International Features of the XML Streams Module : Determining your Character Encoding Needs
Determining your Character Encoding Needs
The XML Streams package provides a simple interface for inserting or extracting characters and strings from various encodings.
Depending on your character encoding requirements, you may or may not need to build and link the Internationalization Module. Because building and linking the Internationalization Module also links the entire International Consortium of Unicode (ICU) libraries, it is wise to first evaluate your encoding needs.
Deciding if You Need the Internationalization Module
This section provides an overview to help you evaluate your character encoding requirements, and whether or not you need the Internationalization Module. “The XML Streams Package Character Encoding Requirements” includes more detailed information on the specific requirements of the XML Streams package.
When You Do Not Need the Internationalization Module
You do not need the Internationalization Module if you know you will be building XML streams in one of the following encodings:
UTF-8
US-ASCII
ISO-8859-1 (Latin-1, a superset of US-ASCII)
UTF-16BE or UTF-16LE
(“BE” and “LE” signify “big endian” or “little endian” and refer to how bytes in a multibyte string are sorted when converted to a numeric representation. Big endian places the most significant byte at the lowest address, storing the “big end first.” By contrast, little endian places the least significant bytes at lower addresses, storing the “little end first.” The default for UTF-16 is BE, unless specified otherwise.)
SourcePro’s regular string and stream processing classes can accommodate these five encodings without linking the Internationalization Module.
If you need conversion to and from other encodings, or more advanced manipulation of strings in UTF-16, you will want to use the Internationalization Module.
When You Do Need the Internationalization Module
If you are building and serializing XML streams in other encodings than those listed in the previous “When You Do Not Need the Internationalization Module”, you must build and link the Internationalization Module. The Internationalization Module can convert a byte stream to and from any encoding and UTF-16, and offers advanced manipulation of strings, such as collation, Unicode regular expression searches, and resource bundles.
The XML Streams Package Character Encoding Requirements
The classes in the XML Streams package all read and write UTF-8 encoded documents. This means that the XML input streams take in UTF-8 only, and the XML output streams produce UTF-8 only.
You can, however, take advantage of various conversion utilities in SourcePro Core to convert your XML streams to and from any recognized character encoding. The Essential Tools Module and Internationalization Module contain classes that help you convert to and from UTF-8 prior to sending your data or character into the input stream, and after your data or character is returned by the output stream.
In addition, you may use a UTF-16 Unicode or wide character inserter or extractor interface to your XML streams, and the XML streams classes will internally convert between UTF-16 and UTF-8 as necessary. For a discussion on narrow character, wide, and Unicode interfaces, see “Narrow Character Interfaces” and “Wide and Unicode Character Interfaces.”
NOTE >> If your strings are in a non-UTF-8 or UTF-16 encoding, you must first convert them before inserting them into an XML stream. For more information, see “Using the XML Streams Package with the Internationalization Module”.
Narrow Character Interfaces
All narrow character interfaces, such as RWCString, char, and char* inserters and extractors, take or produce only UTF-8 encoded characters. If you are using an XML output stream with a narrow character interface, and you try to insert into the stream a non-UTF-8 character, the stream may produce an incorrect document. If your character encoding is UTF-16, you may use RWBasicUString from the Essential Tools Module to convert it to UTF-8. If your encoding is other than UTF-8 or UTF-16, you will need to use RWUString and the conversion utility classes from the Internationalization Module. See “Using the XML Streams Package with the Internationalization Module.”
Wide and Unicode Character Interfaces
All wide and Unicode character interfaces, such as RWWString, RWBasicUString, RWUString, wchar_t, and wchar_t* inserters and extractors, take or produce only UTF-16 encoded characters. If you are using an XML output stream with a wide or Unicode character interface, and you try to insert into the stream a non-UTF-16 character, the stream may produce an incorrect document.
Output Streams
XML output streams convert UTF-16 encoded characters to UTF-8 before passing them on to the underlying data stream, as illustrated in Figure 2.
You may optionally convert your strings to another encoding after extracting them from the XML.
Figure 2 – Wide or Unicode Interfaces to Output Streams
Input Streams
XML input streams convert from UTF-8 to UTF-16 before returning wide or Unicode characters or strings, as illustrated in Figure 3.
You may optionally convert your strings to another encoding after extracting them from the XML.
Figure 3 – Wide or Unicode Interfaces to Input Streams