Character Sets

DB Interface Module User's Guide
Rogue Wave web site: Home Page | Main Documentation Page

13.2 Character Sets

Most relational databases were developed in environments where the primary language was English. In these environments, database servers stored character data in some variant of the CHAR or VARCHAR data types. As database vendors expanded beyond the English-speaking markets, demand increased for different native character sets. In response, the NCHAR and NVARCHAR data types were created for holding character data in national character sets.

In this chapter, we use the terms:

standard character set data types, to mean the original CHAR or VARCHAR data types;

national character set data types, to mean the newer NCHAR and NVARCHAR data types.

Unfortunately, database vendors did not standardize on a common set of features and capabilities for these new data types. Some databases implement national character set support in their standard character data types and use NCHAR and NVARCHAR as synonyms. Other vendors implement the data types identically except for the collation sequencing capabilities. Still others use completely separate implementations for standard and national character set data types. The documentation provided by your database vendor should help you identify the vendor's implementation technique.

The DB Interface Module is designed to make the differences between database implementations nearly invisible, but some differences do persist. Please consult the internationalization section of your DB Access Module Guide to learn about the behavior differences.

In the examples in this chapter, we use Chinese characters that represent something similar to "Hello, world." These characters were selected from the Unicode standard. In order for these examples to run properly, the machine must have the appropriate locales and environment installed. Please note that these examples are intended to show how to use the various string classes available to SourcePro DB, rather than how to write portable, correct Unicode applications.

13.2.1 National Character Sets and C++ Data Types

SourcePro DB uses four different C++ classes to hold character string data:

RWCString from the Essential Tools Module is used for standard ASCII strings.

RWWString from the Essential Tools Module is used for wide character strings, such as UCS-2 or UCS-4.

RWDBMBString from the DB Interface Module is used for multibyte characters strings, such as UTF-8.

RWBasicUString from the Essential Tools Module is used to encapsulate UTF-16 characters and strings.

Although class RWCString is capable of storing multibyte character strings, you are encouraged to use RWDBMBString for multibyte strings in SourcePro DB. Because some databases differentiate between multibyte and standard ASCII strings, applications using RWDBMBString for multibyte character strings maximize portability to other databases.

For Unicode applications, however, you are encouraged to use class RWBasicUString or RWUString instead of RWDBMBString or RWWString. In SourcePro DB, all the different database vendor Unicode types are mapped to RWBasicUString. RWBasicUString is platform independent (always 2 bytes), and contains methods implemented specifically for manipulating UTF-16 data. Also, because RWBasicUString is the base class of RWUString, applications built with RWBasicUString can integrate seamlessly with the SourcePro C++ Internationalization Module.

You are encouraged to use RWDBMBString rather than RWCString for storing multibyte character strings, and to use RWBasicUString rather than RWWString or RWDBMBString for handling UTF-16 data.

The actual character sets used by a given system depend on several aspects of the hardware and software installation. When an operating system is installed on a machine, a character set is selected to represent the keyboard attached to the machine as well as some possible supplementary character sets. A database also has at least one character set associated with the server and one with the client.

It is important to ensure compatibility between the default character set of the operating system and the character set of the client database software. The DB Interface Module does not implement translations between character sets, but it may forward a translation request to the underlying operating system for translations between wide and multibyte strings. If there is an incompatibility between the operating system's multibyte character set and the multibyte character set expected by a database's client software, there will be problems. UTF-16 data does not undergo any translations and is sent directly to the database client.

From the standpoint of SourcePro DB, the character set on the database server is irrelevant. It is the responsibility of the database software to translate between the server and client character sets. It is the responsibility of the system administrator to insure that this mapping of character sets is working properly.

Incompatibility between the multibyte character set used by the operating system and the multibyte character set expected by database client software causes problems. It is the responsibility of your system administrator to ensure compatibility.

In the following sections, we discuss in more detail the four different C++ classes used by SourcePro DB to hold character string data.

13.2.2 RWBasicUString

Class RWBasicUString from the Essential Tools Module is used for UTF-16 Unicode strings in SourcePro DB. For more details on RWBasicUString, please see the SourcePro C++ API Reference Guide.

Class RWBasicUString can be used to fetch or send UTF-16 data to your Unicode database. Please see your Access Module Guide for information about the Unicode capabilities of your Access Module. The Access Module Guide also provides mapping information that describes the SQL type that RWBasicUString maps to for the database you use. SourcePro DB can use class RWBasicUString just as it would use class RWCString or RWWString for fetching and retrieving data.

The following example demonstrates the use of RWBasicUString with RWDBInserter, RWDBSelector, and RWDBReader. Assume that the table, t1, consists of one column of type NVARCHAR:

void
executeUsingUnicodeStrings(RWDBDatabase& aDB)
{
  RWBasicUString bustring("\346\202\250\345\245\275");

  RWDBTable t1 = aDB.table("t1");
  RWDBInserter ins = t1.inserter();
  ins << bustring;
  ins.execute();

  RWDBSelector sel = aDB.selector();
  sel << t1;
  sel.where(t1["a"] == bustring);

  RWDBReader rdr = sel.reader();
  RWBasicUString outBUString;
  while(rdr()) {
    rdr >> outBUString;
  }
}

You can use RWBasicUString like any other data type in SourcePro DB.

RWUString from the Internationalization Module can also be used, since it is derived from the Essential Tools data type RWBasicUString. However, if you are creating an RWDBTBuffer instance with data type RWUString, you must include the header file <rw/db/tbuffer_ustr.h> instead of <rw/db/tbuffer.h>.

13.2.3 RWWString

Class RWWString from the Essential Tools Module is used for wide character strings in SourcePro DB. While RWWString can hold values from any wide character set, your operating system determines significantly how the values are interpreted. If an instance of this class is used for print or screen output, the output device must understand the character set. For more details on RWWString, please see the SourcePro C++ API Reference Guide.

You are encouraged to use RWBasicUString rather than RWWString for handling Unicode data. Platform-dependent definitions of the wide character type will decrease the portability of your applications.

Note that wide character strings are rarely serialized; multibyte strings are used for serialization. To send a wide character string to a database, it must be translated into multibyte form. The operating system provides calls that translate a wide string into a multibyte string. The DB Interface Module automatically employs the translation system calls to send a wide string to a server. This applies only when serializing literal values and does not apply to bound data.

Class RWWString may be used both for sending data to a database client and for receiving the data that is fetched. When using a wide string with an RWDBInserter, the destination column must be able to handle characters from a national character set. Usually this would be NCHAR or NVARCHAR, but for some databases it could be other CHAR or VARCHAR variants. Similarly, when an RWWString is used in an expression with class RWDBExpr, it should be used where a national string makes sense semantically.

The following example demonstrates the use of RWWString with both RWDBInserter and RWDBSelector. Assume that the table, t1, consists of one column of type NVARCHAR.

Applications using RWWString are not portable. If a system defines a wchar_t as 4 bytes and a database client defines an NVARCHAR type as 2 bytes, the database client may reject the wchar_t as invalid input and give a binding or type conversion error.

void
showSQLUsingWideStrings RWDBDatabase& aDB)
{
  RWWString wstring("\346\202\250\345\245\275",
                    RWWString::multiByte);

  RWDBTable t1 = aDB.table("t1");
  RWDBInserter ins = t1.inserter();
  ins << wstring;
  cout << ins.asString(true) << endl;

  RWDBSelector sel = aDB.selector();
  sel << t1;
  sel.where (t1["a"] == wstring);
  cout << sel.asString(true) << endl;
}

The output of this demonstration routine is two SQL statements. They represent a SELECT statement and an INSERT statement of the form:

INSERT INTO t1 VALUES ('')
SELECT * FROM t1 WHERE t1.a =''

In both cases, the actual strings reflect the dialect of SQL understood by the server, which is represented by aDB. When sending a literal value, the quoting technique required by that server for multibyte strings is placed around the multibyte string.

The use of RWWStrings with cursors is a little more restrictive. The originating column must be of a type appropriate for national character strings. The NCHAR and NVARCHAR data types are always acceptable; however, for many databases CHAR and VARCHAR variants also work. The next example demonstrates how to use an RWWString to fetch data from a database with an RWDBCursor and RWDBReader:

void
getWideStrings(RWDBDatabase& aDB)
{
  RWDBTable t1 = aDB.table("t1");

  RWWString wstring;

  {
    cout << "t1 using a cursor" << endl;
    RWDBCursor cur = t1.cursor();
    cur << &wstring;
    while (cur.fetchRow().isValid())
      cout << wstring << endl;
  }

  {
    cout << "t1 using a reader" << endl;
    RWDBReader rdr = t1.reader();
    while (rdr()) {
      rdr >> wstring;
      cout << wstring << endl;
    }
  }
}

If the display device used understands the character sets involved and the table t1 contains the single row of our Chinese hello example, the output of this sample program should look something like this:

t1 using a cursor

t1 using a reader

Finally, RWWString can be used with RWDBStoredProc the same way as with other data types. You should check your Access Module Guide to see if there are any specific restrictions that apply to your database.

13.2.4 RWDBMBString

The main purpose of class RWDBMBString is to assist users of the DB Interface Module in differentiating between plain ASCII strings and multibyte character set strings. For example, some databases may require quotation marks around national character set strings when inserted into or compared with national character set columns (NCHAR, NVARCHAR). If all strings are stored in RWCString instances, the DB Interface Module cannot determine which strings need special quotes and which do not.

Some databases require different treatment for standard character strings and national character strings. For this reason, it is the application programmer's responsibility to treat standard character string columns and national character string columns as different types.

If multibyte strings and national character columns are used in a DB Interface Module application, using RWDBMBString as the implementation of the multibyte strings ensures maximum cross-database portability.

You should use RWDBMBString like any other data type. You should check your Access Module Guide to see if there any specific restrictions that apply to your database.

RWDBMBString is a direct subclass of RWCString from the Essential Tools Module. All the member functions that are available in RWCString are also available in RWDBMBString. This includes the use of regular expression and substring classes.

13.2.5 RWCString

RWCStrings can be used to hold and manipulate multibyte strings. SourcePro DB allows their use even when associated with national character set columns, although it may not format them properly since RWCString is also the default class associated with standard character string columns. It is recommended that SourcePro DB application programs use RWDBMBStrings for national character set columns and RWCString for standard character set columns.

13.2.6 Data Definition Language

The encapsulation of Data Definition Language (DDL) by the DB Interface Module allows the creation of national character set columns when defining new tables. The public enum ValueType, from class RWDBValue, is used by class RWDBColumn to specify the type of a column. When creating a table, specifying a column type of RWDBValue::UString, RWDBValue::MBString, or RWDBValue::WString results in a national character set column type. Please see your Access Module User Guide for more information on the SQL type to which these RWDBValue types map.

13.2.7 Using Schema Information from Result Tables

While some databases require special treatment for the standard and multibyte strings sent to servers, some do not differentiate when returning standard and multibyte strings. Please check your Access Module User's Guide to see what type mappings are defined for the SQL types in your database when retrieving data.

The Rogue Wave name and logo, and SourcePro, are registered trademarks of Rogue Wave Software. All other trademarks are the property of their respective owners.
Provide feedback to Rogue Wave about its documentation.