Chapter 7 International Features of the XML Streams Module

The XML Streams Module works with other SourcePro modules to allow you to create applications for an international audience. As part of this support for internationalization, the XML Streams Module allows you to build and serialize XML streams from almost any character encoding.

A character encoding -- or more formally a “coded character set” -- is a character set and its numerical representation.

Your character encoding will help you determine if you need to build and link SourcePro Core’s Internationalization Module.

For instance, if you know your XML stream will be in US-ASCII, you will not need the Internationalization Module; however, if you think you may need to convert your streams to and from an Asian character set, such as the Japanese Shift-JIS or the Arabic 8859-6 for example, you can take advantage of the power of the Internationalization Module.

This chapter discusses SourcePro’s capabilities for internationalizing your XML streams and serialization process, including

• How to determine your character encoding needs

• How to use the Internationalization Module with the XML Streams package

For detailed information on internationalizing and localizing applications, see the Internationalization Module User’s Guide.

This chapter refers to the Unicode encoding forms UTF-8 and UTF-16, as they are used internally by the SourcePro’s modules to store and manipulate text. This section provides an overview of these terms.

The Unicode Standard is able to encode all characters used for nearly all written languages in the world. It defines three main encoding forms: UTF-8, UTF-16, and UTF-32.

The XML Streams Module offers conversion of strings and XML streams to and from UTF-8 and UTF-16. The Internationalization Module offers conversions to and from any character encoding and UTF-16.

Each encoding form serves a different purpose, offering a programmer the opportunity to select the best development strategy, given the application’s and the system’s memory requirements.

• UTF-8 uses 8-bit code units to represent each 21-bit Unicode code point. Storing a character may take from one to four code units. This form offers backwards compatibility with ASCII-based APIs and other protocols and is the likely choice when the required character set is US-ASCII or non-Asian.

• UTF-16 uses 16-bit code units to represent each 21-bit Unicode code point. It is the encoding form used by RWBasicUString from the Essential Tools Module and the RWUString class of the Internationalization Module and may contain either one or two 16-bit code units per character. UTF-16 is usually a good choice for most Asian character sets.