Internationalization Concepts

HydraExpress Web Service Development Guide
Rogue Wave web site: Home Page | Main Documentation Page

19.5 Internationalization Concepts

This section briefly introduces some concepts useful in working with XML documents in various character encodings.

19.5.1 What is a Character Encoding?

A character encoding -- or more formally a "coded character set" -- is a character set and its numerical representation.

If your XML document's character encoding is anything other than UTF-8, you can use HydraExpress's international capabilities to convert it to and from your own encoding in order to manipulate it in the encoding of your choice.

19.5.2 An Introduction to Unicode

The related code examples on internationalization refer to the Unicode encoding forms UTF-8 and UTF-16, as they are used internally by HydraExpress to manipulate text and convert XML documents between UTF-8 and other encodings. This section provides an overview of these terms.

The Unicode Standard is able to encode all characters used for nearly all written languages in the world. It defines three main encoding forms: UTF-8, UTF-16, and UTF-32.

HydraExpress uses both UTF-8 and UTF-16 to perform character conversions. Each encoding form serves a different purpose, offering a programmer the opportunity to select the best development strategy, given the application's requirements and the system's memory requirements.

UTF-8 uses 8-bit code units to represent each 21-bit Unicode code point. Storing a character may take from one to four code units. This form offers backwards compatibility with ASCII-based APIs and other protocols and is the likely choice when the required character set is US-ASCII or non-Asian.
UTF-16 uses 16-bit code units to represent each 21-bit Unicode code point. It is the encoding form used by the RWUString class of the Internationalization Module of SourcePro Core and may contain either one or two 16-bit code units per character. UTF-16 is usually a good choice for most Asian character sets.

19.5.3 Character Encoding in an XML Prolog

An XML document always starts with a prolog. The prolog describes the contents of the document including its character encoding. The following prolog contains a mandatory version number and the optional encoding declaration.

<?xml version="1.0" encoding="Shift_JIS "?>

The entire contents of the XML document following the "EncodingDecl" section of the XML prolog must be in the specified character set. This includes everything in the message: URIs, end-of-line characters, whitespace, etc.

For example, in the XML fragment above, all characters following the "?>" must be in the Shift-JIS encoding. For more information on XML Declarations see the XML 1.0 specification at http://www.w3.org/TR/REC-xml#sec-prolog-dtd.

© Copyright Rogue Wave Software, Inc. All Rights Reserved. All Rights Reserved. Rogue Wave is a registered trademark of Rogue Wave Software, Inc. in the United States and other countries. HydraExpress is a trademark of Rogue Wave Software, Inc. All other trademarks are the property of their respective owners.
Contact Rogue Wave about documentation or support issues.