Chapter 8 International Features of the Advanced Tools Module

The Advanced Tools Module works with other SourcePro modules to allow you to create applications for an international audience. As part of this support for internationalization, the Advanced Tools Module supports building and serializing streams in almost any character encoding.

A character encoding -- or more formally a “coded character set” -- is a character set and its numerical representation.

Your stream’s character encoding will help you determine if you need to build and link SourcePro Core’s Internationalization Module.

For instance, if you know your stream will be in US-ASCII, you will not need the Internationalization Module; however, if you think you may need to convert your streams to and from an Asian character set, such as the Japanese Shift-JIS or the Arabic 8859-6 for example, you can take advantage of the power of the Internationalization Module.

This chapter discusses the SourcePro’s capabilities for internationalizing your streams and serialization process, including

• How to use the Internationalization Module with the Streams package

• How to use the Internationalization Module with the Serialization package

For detailed information on internationalizing and localizing applications, see the Internationalization Module User’s Guide.

This chapter refers to the Unicode encoding forms UTF-8 and UTF-16, as they are used internally by the SourcePro’s modules to store and manipulate text. This section provides an overview of these terms.

The Unicode Standard is able to encode all characters used for nearly all written languages in the world. It defines three main encoding forms: UTF-8, UTF-16, and UTF-32.

The Advanced Tools Module offers conversion of strings and streams to and from UTF-8 and UTF-16. The Internationalization Module offers conversions to and from any character encoding and UTF-16.

Each encoding form serves a different purpose, offering a programmer the opportunity to select the best development strategy, given the application’s requirements and the system’s memory requirements.

• UTF-8 uses 8-bit code units to represent each 21-bit Unicode code point. Storing a character may take from one to four code units. This form offers backwards compatibility with ASCII-based APIs and other protocols and is the likely choice when the required character set is US-ASCII or non-Asian.

• UTF-16 uses 16-bit code units to represent each 21-bit Unicode code point. It is the encoding form used by the RWUString class of the Internationalization Module and may contain either one or two 16-bit code units per character. UTF-16 is usually a good choice for most Asian character sets.