RWUFromUnicodeConverter

Internationalization Module Reference Guide
Rogue Wave web site: Home Page | Main Documentation Page

RWUFromUnicodeConverter

Module: Internationalization Module Group: Character Encoding Scheme Conversion


RWUFromUnicodeConverter  RWUConverterBase

Local Index
Header File
Description
Example
Related Classes
Public Enums
Public Constructors
Public Destructor
Public Member Operators
Public Member Functions
Class ErrorResponseState

Local Index

Members

convert()
ErrorResponseState()
ErrorResponseType

getSubstitutionSequence()
operator=()
reset()

restoreErrorResponseState()
RWUFromUnicodeConverter()
saveErrorResponseState()

setErrorResponse()
setSubstitutionSequence()

Header File

#include <rw/i18n/RWUFromUnicodeConverter.h>

Description

RWUFromUnicodeConverter provides a unidirectional text conversion facility for translating from UTF-16 to various byte-oriented standard character encoding schemes.

The convert() method appends the results of a conversion to a target buffer. If its flush argument is true, convert() flushes its internal buffers to the target buffer and clears its internal state. For modal encodings such as ISO-2022, clearing the internal state ensures that the next call to convert() produces target text that begins in the target encoding's default, unshifted state.

Calling convert() once with a value of true for flush is useful when converting a piece of text in its entirety from UTF-16 to a target encoding. In contrast, convert() may be used to fill a target buffer in a piecemeal fashion. Repeatedly calling convert() with a value of false for flush, then calling it once with a value of true, causes convert() to flush its buffers and clear its internal state only at the end of a multi-invocation conversion process.

At the conclusion of a successful call to convert() with flush set to true, the converter is reset automatically to a default, initial state, ready to start a new conversion process. Sometimes, however, it may be necessary to reset a converter explicitly using the reset() method:

if convert() has thrown an exception in response to an error, and you want to be sure the converter is in the default state before using it again
if you are using the converter to fill a target buffer in a piecemeal fashion, and you wish to abandon that conversion process to begin another
if you are copying a converter, and want to be sure the copy is in the default state

Example

#include <rw/i18n/RWUToUnicodeConverter.h>
#include <rw/i18n/RWUFromUnicodeConverter.h>
#include <rw/i18n/RWUString.h>
#include <iostream>

using std::cout;
using std::endl;

int
main()
{
  // Convert from ISO-8859-1 to UTF-16.
  RWUToUnicodeConverter fromIso_8859_1("ISO-8859-1");
  RWCString cstr("She sat in the café, sipping coffee.");
  RWUString ustr;
  fromIso_8859_1.convert(cstr, ustr);

  // Convert from UTF-16 to US-ASCII.  Note that `?' is
  // substituted for `é', which cannot be represented
  // in US-ASCII.
  RWUFromUnicodeConverter toUsAscii("US-ASCII");
  toUsAscii.setSubstitutionSequence("?", 1);
  cout << ustr.toBytes(toUsAscii) << endl;

  // Save the error response state
  RWUFromUnicodeConverter::ErrorResponseState state =
    toUsAscii.saveErrorResponseState();

  // Convert from UTF-16 to US-ASCII again, replacing
  // `é' with an escape sequence suitable for use in
  // an XML or HTML file.
  toUsAscii.setErrorResponse(
   RWUFromUnicodeConverter::EscapeXmlHexadecimal);
  cout << ustr.toBytes(toUsAscii) << endl;

  // Restore the original error response state
  toUsAscii.restoreErrorResponseState(state);

  return 0;
} // main

Results:
========

She sat in the caf?, sipping coffee.
She sat in the caf&xE9;, sipping coffee.

Public Enums

enum ErrorResponseType { Stop,
                         Skip,
                         Substitute, 
                         EscapeNativeHexadecimal, 
                         EscapeJavaHexadecimal, 
                         EscapeCHexadecimal,
                         EscapeXmlDecimal,
                         EscapeXmlHexadecimal
};

An ErrorResponseType value indicates what action an RWUFromUnicodeConverter should take when it encounters an error during the conversion process. Potential errors include code points with no mapping in the target encoding, and ill-formed code unit sequences, such as a low surrogate not followed by a high surrogate or a high surrogate without a preceding low surrogate. The default error response is RWUFromUnicodeConverter::Substitute. See setErrorResponse().

The meaning of the ErrorResponseType values are as follows:

Stop

Stops the conversion process, and throws an RWUException.

Skip

Silently skips over any illegal sequences, without writing to the target buffer.

Substitute

Substitutes illegal sequences with the current substitution sequence. The default substitution sequence depends on the target encoding. For ASCII-based encodings, the default substitution sequence is 0x1A. See setSubstitutionSequence().

EscapeNativeHexadecimal

Replaces illegal sequences with a %UX escaped hexadecimal representation of the code units that comprise the illegal sequence--for example, %UFFFE%U00AC. Note that a code point represented by a surrogate pair is escaped as two hexadecimal values. If the target encoding does not support the characters {U,%}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.

EscapeJavaHexadecimal

Replaces illegal sequences with a \uX escaped hexadecimal representation of the code units that comprise the illegal sequence--for example, \uFFFE\u00AC. Note that a code point represented by a surrogate pair is escaped as two hexadecimal values---for example, \uD84D\uDC56. If the target encoding does not support the characters {u,\}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.

EscapeCHexadecimal

Replaces illegal sequences with a \uX escaped hexadecimal representation of the code units that comprise the illegal sequence--for example, \uFFFE\u00AC. Note that a code point represented by a surrogate pair is escaped as a single hexadecimal value--for example, \u00023456. If the target encoding does not support the characters {u,\}[A-F][0-9], an illegal sequence is replaced by the substitution sequence.

EscapeXmlDecimal

Replaces illegal sequences with a &#DDDD; escaped decimal representation of the code units that comprise the illegal sequence; for example, ¬. Note that a code point represented by a surrogate pair is escaped as a single decimal value without zero padding; for example, 𣑖. If the target encoding does not support the characters {&,#,;}[0-9], an illegal sequence is replaced by the substitution sequence.

EscapeXmlHexadecimal

Replaces illegal sequences with a &#XXXX; escaped hexadecimal representation of the code units that comprise the illegal sequence; for example, ¬. Note that a code point represented by surrogate pair is escaped as a single hexadecimal value without zero padding; for example, 𒍅. If the target encoding does not support the characters {&,#,x,;}[0-9], an illegal sequence is replaced by the substitution sequence.

Public Constructors

RWUFromUnicodeConverter(const char* encoding);

Constructs an RWUFromUnicodeConverter for the character encoding scheme given by encoding, the ASCII name or alias of a character encoding scheme (see RWUAvailableEncodingList and RWUEncodingAliasList).

Throws RWUException to indicate that the converter could not be constructed. The exception carries one of the following status codes:

RWUMemoryAllocationError

Indicates that the memory required by the converter could not be allocated.

RWUFileAccessError

Indicates that the requested converter could not be found or opened.

RWUFromUnicodeConverter(const RWUConverterBase& original);

Constructs a converter that is a deep copy of another converter. The new converter uses the same character encoding scheme as the original converter, and possesses the same internal state as the original converter.

Exercise care when copying converters, especially those used for stateful or multibyte encodings. The new converter may be initialized in a state that causes the converter to produce errors if used to convert a new chunk of text. Consider using reset() to restore the converter to a known default state before use.

Throws RWUException to indicate that the copy could not be completed because memory could not be allocated for the underlying implementation object.

Public Destructor

~RWUFromUnicodeConverter();

Destructor.

Public Member Operators

RWUFromUnicodeConverter&
operator=(const RWUConverterBase& rhs);

Assignment operator. Makes self a deep copy of rhs. Self uses the same character encoding scheme as rhs, and possesses the same internal state as rhs.

Throws RWUException to indicate that the copy could not be completed because memory could not be allocated for the underlying implementation object.

Public Member Functions

void
convert(const RWUString& source, RWCString& target,
        bool flush = true);

Converts the sequence of UTF-16 code units contained in the given RWUString into the sequence of bytes required to represent the source in the target character encoding scheme and appends that sequence of bytes to the target RWCString.

The boolean value flush specifies whether self should be flushed to ensure that any code units stored in the converter's internal state are written to target. This behavior depends on the underlying conversion routine supplied by the ICU. ICU versions prior to 2.8 allowed us to save the internal state explicitly by passing a flush argument of false. With ICU 2.8 and above, only the true value has effect: the default (true) argument explicitly forces a flush. In addition, flush will be called implicitly if the internal conversion buffer is not filled by the latest conversion request.