3.4 Representing Strings

In the Internationalization Module, RWUString provides a container for text encoded using the UTF-16 character encoding form of the Unicode character set. RWUString derives from RWBasicUString in the Essential Tools Module of SourcePro Core, which manages a basic array of RWUChar16 values.

3.4.1 RWBasicUString and RWCString

RWBasicUString is similar to RWCString. For example:

Both classes have methods append(), prepend(), insert(), remove(), and replace() for modifying a string.

Both classes also have methods first(), last(), index(), rindex(), and contains() that search for characters or strings of characters contained with a string.

Both classes have methods compareTo() for lexically ordering strings.

RWBasicUString differs from RWCString in that an RWBasicUString instance contains a series of Unicode characters encoded in UTF-16, while an RWCString instance contains bytes encoded in an arbitrary encoding. RWBasicUString also performs conversion between UTF-16 and UTF-8. Because RWBasicUString contains UTF-16, its API has some methods that RWCString does not. For example:

Methods requiresSurrogatePair(), isHighSurrogate(), and isLowSurrogate() indicate whether a 21-bit Unicode code point requires a surrogate pair of UTF-16 code units. Most characters can be represented in the UTF-16 encoding form with a single 16-bit code unit. Only characters in the range 0x10000 to 0x10FFFF must be represented with a surrogate pair of two UTF-16 code units.

Method computeCodePointValue() returns the appropriate RWUChar32 code point given a surrogate pair of RWUChar16 code units.

Methods highSurrogate() and lowSurrogate() return the first and second surrogate RWUChar16 code units for a given RWUChar32 code point.

Methods compareCodeUnits() and compareCodePoints() perform code unit and code point ordering of strings, respectively. Code unit ordering of two strings may differ from code point ordering if either string contains surrogate pairs.

Methods codeUnitLength() and codePointLength() return the number of code units or code points in a string. The standard length() method is equivalent to codeUnitLength().

Method toUtf8() returns an RWCString containing a UTF-8 representation of the string.

Method toUtf32() returns a std::basic_string templatized on RWUChar32 containing a UTF-32 representation of the string.

Method toWide() returns an RWWString containing a UTF-16 or UTF-32 representation of the contents of the string. The representation depends on the size of wchar_t. If sizeof(wchar_t) is 2, the RWWString is encoded in UTF-16. If sizeof(wchar_t) is 4, the RWWString is encoded in UTF-32.

Method validateCodePoint() throws an RWConversionErr if a given RWUChar32 code point is not a valid Unicode character, or returns the code point if it is valid. This method can be used to validate a code point value anywhere one is passed to a method.

3.4.2 Memory Management in RWBasicUString

In typical usage, an RWBasicUString instance owns and manages the memory required to hold an array of RWUChar16 values. Like RWCString, RWBasicUString normally copies input data to an internal buffer. This usage is both safe and convenient.

In some cases, however, such as constant strings or large strings, it may be more efficient to avoid this initial copy by having RWBasicUString use an externally-supplied buffer. Therefore, RWBasicUString can also be constructed with two alternate memory management strategies:

An RWBasicUString instance can reference an external buffer in a read-only fashion. In this case, a client supplies the constructor with a Duration value of Persistent. Any attempt to modify the external buffer causes RWBasicUString to copy its contents to an internal buffer. This strategy is primarily used to treat static arrays or arrays of some other long storage duration as RWBasicUString instances. For example:

// At file scope

static RWUChar16 acronym = [ 0x0052, 0x0057, 0x0000 ];

RWBasicUString

getAcronymAsUString()

{

return RWBasicUString(acronym, RWBasicUString::Persistent);

}

An RWBasicUString instance can assume ownership of an external buffer, and use it in a read-write fashion. To pass ownership of a buffer to an RWBasicUString, a client supplies the RWBasicUString constructor with an RWBasicUString::Deallocator object that can be used to deallocate the buffer. (See Section 3.4.2.1.) This strategy is reminiscent of that offered by std::auto_ptr<T>, except that RWBasicUString implements copy construction and assignment via reference counting. An RWBasicUString mutator modifies the external buffer directly if its capacity is large enough. Otherwise, the mutator copies the buffer's contents to an internal buffer, deallocates the external buffer, then modifies the internal buffer.

Note that in both cases, although the client’s choice of constructor determines the initial memory management strategy, RWBasicUString will abandon an externally-supplied buffer in favor of an internal buffer as necessary.

3.4.2.1 Creating and Using Deallocators

Passing ownership of a buffer to an RWBasicUString involves supplying the RWBasicUString with an RWBasicUString::Deallocator object. RWBasicUString::Deallocator is an abstract base class that cannot be instantiated directly. Instead, a deallocator can be created in one of two ways:

Create an instance of RWBasicUString::StaticDeallocator, which derives from RWBasicUString::Deallocator.

An RWBasicUString::StaticDeallocator object wraps a pointer to a class static method or a global function. As a convenience, RWBasicUString supplies three such functions: USE_DELETE(), USE_FREE(), and USE_NONE(). For example, the following code creates an RWBasicUString::StaticDeallocator that invokes delete[] to deallocate string buffers. These buffers are returned from a third-party library that allocates buffers via new:

// Create a deallocator. It will be re-used by multiple

// RWBasicUString instances.

RWBasicUString::StaticDeallocator

deallocator(RWBasicUString::USE_DELETE);

// Return RWBasicUStrings that reference externally-supplied

// buffers.

RWBasicUString

getStringFromOutsideSource()

{

RWUChar16 *array = callToOutsideSource();

return RWBasicUString(array, &deallocator);

}

Create an instance of a custom RWBasicUString::Deallocator subclass.

The subclass can deallocate string buffers in the manner of its choice, to match the manner in which the buffers are allocated.

The use of RWBasicUString::StaticDeallocator allows the client to choose delete[], free(), or custom memory-management mechanisms. The use of an externally supplied deallocation method can also be used to satisfy the heap management requirements of MS-Windows dynamic linked libraries, which in some situations may create their own heap in addition to that of the calling process.

3.4.2.2 Null Termination

Given sufficient capacity, RWBasicUString adds a null terminator to any non-static array passed to it. This terminating null is not considered part of the contents, and is not included in the count returned by length().

3.4.3 RWUString and RWBasicUString

RWUString extends RWBasicUString in the Essential Tools Module of SourcePro Code. RWUString is used throughout the API in the Internationalization Module to support locale-sensitive string searching and sorting, string normalization, and conversions between UTF-16 and hundreds of other encodings--all functionality added to RWBasicUString. For example, the comparison methods on RWBasicUString simply compare the numerical values of the individual code units or code points, but RWUString can be used in conjunction with RWUCollator in the Internationalization Module to perform locale-sensitive collation (Chapter 6). RWUString also has access to the comprehensive set of properties for each -Unicode code point in the Unicode Character Database, so methods such as toUpper() and toLower() behave in a locale-sensitive manner.

3.4.4 Creating an RWUString

RWUString instances can be constructed from:

a null-terminated sequence of char, RWUChar16, or RWUChar32 values

a sequence of char, RWUChar16, or RWUChar32 values of a specified length that may contain embedded nulls

a std::string, RWCString, or RWBasicUString

an RWCSubString, RWCConstSubString, RWUSubString, or RWUConstSubString

RWUString inherits the memory management options of RWBasicUString. See Section 3.4.2 for more information.

3.4.5 Converting to Unicode

When an RWUString is constructed from a non-Unicode character or string, the non-Unicode character or string is converted into Unicode. Some RWUString constructors accept RWUToUnicodeConverter arguments to specify explicitly how to convert from a non-Unicode string to a Unicode string. For example:

RWUToUnicodeConverter fromAscii("US-ASCII");

RWUString str("Hello World", fromAscii);

The constructor converts the string literal from US-ASCII to Unicode; the new RWUString str contains the Unicode representation of Hello World.

Non-Unicode strings can also be converted implicitly by specifying a default conversion context. For example:

RWUToUnicodeConversionContext fromAsciiContext("US-ASCII");

RWUString str = "hello";

See Chapter 4 for more information on conversion.

3.4.6 Converting from Unicode

RWUString provides the toBytes() method that accepts an RWUFromUnicodeConverter instance, and returns an RWCString containing the byte sequence produced when the contents of the RWUString are converted into the specified encoding. For example, assuming source is an RWUString:

RWUFromUnicodeConverter toShiftJis("Shift-JIS");

RWCString target = source.toBytes(toShiftJis);

The new RWCString target contains bytes representing characters encoded in Shift-JIS.

RWUString instances can also be converted implicitly by specifying a default conversion context and calling toBytes() with no arguments. For example:

RWUFromUnicodeConversionContext toShiftJisContext("Shift-JIS");

RWCString target = source.toBytes();

The stream insertion operator for RWUString also performs conversions. It writes the sequence of bytes that are produced when the contents of a string are converted into the encoding specified by the currently active RWUFromUnicodeConversionContext. For example, assuming str is an RWUString:

RWUFromUnicodeConversionContext toShiftJisContext("Shift-JIS");

std::cout << str << std::endl;

See Chapter 4 for more information on conversion.

3.4.7 Escape Sequences

RWUString provides the unescape() method that replaces hexadecimal character escapes with their corresponding Unicode characters. The recognized escape sequences are shown in Table 1. The value of any other escape sequence is the value of the character that follows the backslash.

Table 1 – Recognized Escape Sequences
Escape Sequence	Unicode
\uhhhh	4 hexadecimal digits in the range [0-9A-Fa-f]
\Uhhhhhhhh	8 hexadecimal digits
\xhh	1 or 2 hexadecimal digits
\ooo	1, 2, or 3 octal digits in the range [0-7]
\a	U+0007: alert (BEL)
\b	U+0008: backspace (BS):
\t	U+0009: horizontal tab (HT)
\n	U+000A: newline/line feed (LF)
\v	U+000B: vertical tab (VT)
\f	U+000C: form feed (FF)
\r	U+000D: carriage return (CR)
\"	U+0022: double quote
\'	U+0027: single quote
\?	U+003F: question mark
\\	U+005C: backslash

Note that when you create an RWUString from a string literal containing an escaped character, you must use a double-backslash sequence to escape characters, as the C++ compiler itself treats the \ character as special, denoting the beginning of an escape sequence embedded in the C++ source code. For example:

RWUToUnicodeConverter fromAscii("US-ASCII");

RWUString str("clich\\u00e9", fromAscii);

RWUFromUnicodeConverter toAscii("US-ASCII");

std::cout << str.toBytes(toAscii) << std::endl;

std::cout << str.unescape().toBytes(toAscii) << std::endl;

Results:

========

clich\u00e9

cliché

If an escape sequence is ill-formed, unescape() throws an RWConversionErr. (See this class entry in the SourcePro C++ API Reference Guide.)

3.4.8 String Length

The characteristics of UTF-16 imply that the number of 16-bit code units in an RWUString may differ from the number of code points. Furthermore, the nature of Unicode implies that the number of code points may differ from the number of characters, as interpreted by the end user. Several methods are provided to determine the length of a string:

The inherited length() and codeUnitlength() methods return the number of UTF-16 code units in an RWUString.

The inherited codePointLength() method returns the number of code points in an RWUString.

Note that codePointLength() may be slower than length() or codeUnitLength() because codePointLength() must traverse the string to find code points that arise from surrogate code unit pairs. Since the majority of code points in the current Unicode Standard do not require a surrogate representation, many applications can rely on length() or codeUnitLength() to determine or estimate the number of code points.

An RWUBreakSearch can also be used to iterate over the characters of an RWUString, in the context of a particular locale. (See Chapter 7.)

3.4.9 Comparing Strings

RWUString performs comparisons on a lexical basis. Methods such as compareTo(), contains(), first(), last(), index(), rindex(), strip(), and the global comparison operators compare the bit values of individual code units, not the logical values of code points or characters. In contrast, RWUCollator performs comparisons on a logical basis, following the conventions specified in a given locale. The logical comparisons made by RWUCollator are more likely to match an end user's expectations regarding string equality and ordering. The lexical comparisons made by RWUString, however, are likely to be faster. If two strings contain characters from the same script, and are in the same normalization form, lexical comparisons may be adequate for many purposes. See Chapter 6 for more information on RWUCollator and locale-sensitive collation.

3.4.10 Accessing SubStrings

In the Internationalization Module, RWUSubString and RWUConstSubString provide access to a range of characters within a referenced RWUString:

RWUSubString allows read-write access to a range of code units within a referenced RWUString.

RWUConstSubString allows read-only access to a range of code units within a referenced RWUString.

The range within a referenced RWUString is defined by a starting position and an extent. For example, the 7th through the 11th elements, inclusive, would have a starting position of 6 and an extent of 5.

There are no public constructors for substrings. Substrings are constructed by various functions of the RWUString class. Typically, substrings are created and used anonymously, then destroyed immediately. For example:

RWUString str(“Hello World”);

str(6, 5) = “Mom”;