ObjectStore C++ API User Guide

Chapter 9 Using Asian Language String Encodings

There are many standards for encoding Asian characters. In Japan, for example, five encodings are in broad use: JIS, SJIS, EUC, Unicode, and UTF-8.

Usually an application uses one encoding for all strings to be stored inside a database. The encoding chosen is most often the one used in the operating system of the ObjectStore client.

However, if the application has heterogeneous clients using a variety of encodings, conversion from one encoding to another is necessary at some point. The clients could be traditional ObjectStore client processes or thin-client browsers that emit data in different encodings.

The Class Library: os_str_conv

This class library provides conversion facilities for various Japanese language text encoding methods: EUC, JIS, SJIS, Unicode, and UTF8.

The library provides a facility to detect the encoding of a given string. This is useful for applications in which a client might send strings in an unknown format, a common problem for Internet applications.

The most common application of this class is conversion between EUC and SJIS to provide sharing of data from UNIX <-> Windows applications. JIS is commonly used for email. Applications normally store data in a homogeneous format inside a database, and incoming strings are converted as required before they are persistently allocated. Outgoing strings can also be converted to the client's native encoding. For Web applications this outgoing conversion is usually not necessary since internationally aware browsers (Netscape 2.0 and above, for example) can automatically detect and convert various incoming formats themselves.

The class library currently consists of a single class, os_str_conv, instantiated once for each conversion path required:

os_str_conv(encode_type dst, encode_type src=automatic);

Where

enum encode_type {    /* string encode type ------------------- */ 
UNKNOWN=0,              /* convert or automatic detect fail            */ 
AUTOMATIC,              /* detect automatically                               */ 
AUTOMATIC_ALLOW_KANA,
                                              /* detect automatically, allow half-width-kana */ 
ASCII,                                                               /* ASCII                                       */ 
SJIS,                                         /* Shift-JIS                                  */ 
EUC,                                          /* EUC                                          */ 
UNICODE,                                            /* Unicode (can't automatic detect)   */ 
JIS,                                         /* JIS                                                   */ 
UTF8                                         /* UTF-8 (can't automatic detect)     */ 
/* add new encode type here ! */ 
};

Here is an example. Given an instance of os_str_conv, such as

os_str_conv *sjis_to_euc = new os_str_conv(os_str_conv::EUC, 
os_str_conv::SJIS);

A conversion can be done on char* sjis_src:

char *euc_dest = new char[sjis_to_euc->get_converted_size(
      sjis_src)]; 
sjis_to_euc->convert(euc_dest, sjis_src);

The call to get_converted_size() is not strictly required; it is provided for the convenience of the user to allocate buffers of appropriate size. Because it requires examination of the entire source string, time to complete it is proportional to the source string length.

Automatic Detection of a Source String Encoding

Sometimes, it is not possible for an application to know the encoding of a given source string. os_str_conv provides methods that can analyze a given string and determine its encoding. For example:

os_str_conv *to_euc = new os_str_conv(
      os_str_conv::AUTOMATIC); len =
             to_euc->get_converted_size(unknown_src);
      if (len) 
      { 
      char *euc_dest = 
            new char[to_euc->get_converted_size(unknown_src)]; 
      to_euc->convert(euc_dest, sjis_src);
       } 
      else 
      { 
// couldn't convert -- application needs to handle this! 
      }

Important Note: The autodetector is not guaranteed to work in all cases.

If it fails inside get_converted_size, get_converted_size returns 0 to indicate the failure. Be careful not to allocate strings based on its return value without checking for failure!

Unfortunately, no automatic detection algorithm can correctly distinguish EUC from SJIS in all cases because of overlap in their assignment ranges. Clever algorithms exploit patterns typical of real text. This implementation is reasonably straightforward. The most difficult problem (distinguishing between SJIS half-width kana and EUC) is avoided by asking the user to choose between the two possible interpretations. In nearly all cases, os_str_conv::AUTOMATIC is the appropriate setting.

In practice, the problems of ambiguity are not likely to affect applications, since usually incoming text is all in the single encoding defined by the operating system used when generating it. Autodetect can be used at the beginning of a session only, and it can be reasonable to assume that it will not change.

As mentioned earlier, there are areas in the EUC and SJIS encodings that overlap, and so a given string might be valid in either encoding. This makes autodetection ambiguous.

There are two ambiguous cases:

The half-width kana of SJIS. It is possible for a string consisting entirely of bytes in this range to be either SJIS or EUC. This is the most troublesome case.
An obscure range of SJIS and EUC that overlaps. The characters represented by this range are rarely used, so it is highly unlikely that a string would consist entirely of such characters.

Detection is handled according to these rules:

The algorithm examines each character of the string in sequence until the encoding is determined. Therefore, a string beginning with an unambiguous substring followed by an ambiguous substring is detected according to the first substring.
Strings consisting entirely of the second ambiguous type are handled as unknown. As mentioned, this case is very unlikely.
All SJIS half-width kana are single-byte encodings. Therefore, a string consisting entirely of an odd number of bytes in the SJIS half-width kana range is considered SJIS.
A string beginning with an even number of bytes in the SJIS half-width kana range is ambiguous until the following characters are examined according to normal detection rules.
A string beginning with an odd number of bytes in the SJIS half-width kana range requires special examination of the last character. If this is an EUC first-byte code, and it is followed by a valid EUC second-byte code, then the string is EUC. However, if the following code is not a valid EUC second-byte (it might be ordinary ASCII), then the final character is interpreted as SJIS half-width kana and the string is interpreted as SJIS.
A string consisting entirely of an even number of bytes in the SJIS half-width kana range is ambiguous. It is quite possible for such a string to appear in real applications. The os_str_conv::automatic setting causes the autodetector to interpret this case as SJIS. However, if os_str_conv::automatic_allow_kana, this case is interpreted as unknown. Ojbect Design believes that the SJIS interpretation is correct for most cases.

Japanese developers are aware of the problems handling the half-width SJIS kana, and so they try to avoid them by using full-width SJIS kana instead.

Unlike EUC and SJIS, JIS is a modal encoding that uses <Esc> to enter and exit from multibyte mode. Detecting JIS strings is accomplished by searching for these <Esc> characters.

How to Instantiate the Converter

The class os_conv_str must be instantiated once for each conversion path required for your application.

Guidelines for Extensions to os_str_conv

Users can extend this class by inheriting from it. This could be useful for developers who want to override the existing autodetector.

Additional encodings can be appended to the existing enumeration. Note that os_str_conv depends on the ordering of the existing encodings, so if you extend os_str_conv, additional encodings must appear after the ones already provided.

What Are the Different Modes and Their Meanings?

Notes on encodings

For most purposes, there is a one-to-one mapping for characters to and from each of these encodings, so no semantic information is lost during conversion. There are four exceptions to this rule:

EUC and Unicode are a superset of SJIS and so roundtrip EUC/Unicode<->SJIS is not possible for all EUC/Unicode Japanese characters.
There are a handful of cases of pairs of SJIS characters that map to a single character in Unicode.

The second class of exceptions is considered extremely minor in practice, and is the result of different editions (1983 and 1990) of the JIS as the basis of SJIS and Unicode.

Third, SJIS contains some special characters that are printable on Windows. Although mappings are defined for EUC, attempts to view them on X-windows, at least, fail because the fonts in use do not provide glyphs for those codes. There are no encodings for these characters in Unicode.
Lastly, JIS defines multiple ways to express a character (the base semantic unit), so a conversion from JIS to another encoding and back to JIS is not guaranteed to return an identical binary string. However, the meaning of the string (in the sense of the way it would appear if printed on a screen) is the same.

Variations Among Standard Character Mappings

The Unicode Consortium has published a general mapping from Shift-JIS to Unicode. However, actual implementations of the standard mapping differ slightly by platform and vendor. The os_str_conv class is implemented with a default mapping according to the Unicode Consortium standard, and also provides a means by which any mapping entry can be overridden at run time by a client application.

The deviations in mapping tend to be quite small. For example, here is a table that shows the incompatibility of the Unicode Consortium standard and the maps that Microsoft uses in Windows NT:

SJIS Code Unicode Consortium Mapping Microsoft Mapping
\ 5C 00A5 YEN SIGN 005C REVERSE SOLIDUS(*)
~ 7E 203E OVERLINE 007E TILDE
^[$B!@(B 81,5F 005C Reverse solidus FF3C FULLWIDTH REVERSE SOLIDUS
^[$B!A(B 81,60 301C WAVE DASH FF5E FULLWIDTH TILDE
^[$B!B(B 81,61 2016 DOUBLE VERTICAL LINE 2225 PARALLEL TO
^[$B!](B 81,7C 2212 MINUS SIGN FF0D FULLWIDTH HYPHEN-MINUS
^[$B!q(B 81,91 00A2 CENT SIGN FFE0 FULLWIDTH CENT SIGN
^[$B!r(B 81,92 00A3 POUND SIGN FFE1 FULLWIDTH POUND SIGN
^[$B"L^(B 81,CA 00AC NOT SIGN FFE2 FULLWIDTH NOT SIGN

SJIS Code	Unicode Consortium Mapping	Microsoft Mapping
\ 5C	00A5 YEN SIGN	005C REVERSE SOLIDUS(*)
~ 7E	203E OVERLINE	007E TILDE
^[$B!@(B 81,5F	005C Reverse solidus	FF3C FULLWIDTH REVERSE SOLIDUS
^[$B!A(B 81,60	301C WAVE DASH	FF5E FULLWIDTH TILDE
^[$B!B(B 81,61	2016 DOUBLE VERTICAL LINE	2225 PARALLEL TO
^[$B!](B 81,7C	2212 MINUS SIGN	FF0D FULLWIDTH HYPHEN-MINUS
^[$B!q(B 81,91	00A2 CENT SIGN	FFE0 FULLWIDTH CENT SIGN
^[$B!r(B 81,92	00A3 POUND SIGN	FFE1 FULLWIDTH POUND SIGN
^[$B"L^(B 81,CA	00AC NOT SIGN	FFE2 FULLWIDTH NOT SIGN

Instructions on Overriding Particular Mappings

How to modify standard encodings

To allow an application to modify the standard encoding on the fly, there is the following interface:

class os_str_conv { 
public:  
...
            struct mapping {
                  os_unsigned_int32 dest;      /* destination code */
                  os_unsigned_int32 src;         /* source code      */
             };
            int change_mapping(mapping table[],size_t table_sz);
      ...
      };

You can modify an existing instance of os_str_conv (whether heap- or stack-allocated) by calling os_str_conv::change_mapping(). Actually, internal mapping tables, shared by all instances of os_str_conv, are never modified. The additional mapping table information is stored to provide override information for future conversion services associated with that instance.

The override mapping information applies to whatever explicit mapping has been established for the given os_str_conv instance. Mappings of os_str_conv instances cannot be overridden by instances using autodetect. Attempts to do so return -1 from change_mapping() to indicate this error condition.

The change_mapping() method takes the following two parameters:

os_str_conv::mapping_table[]

This is an array of mapping code pairs that can be allocated locally, globally, or on the heap. If the array is heap-allocated, the user must delete it after calling change_mapping().

Internally, change_mapping() makes a sorted copy of mapping_table[]. The sorting provides quick lookup at run time. The internal copy is freed when the os_str_conv destructor is eventually called.

Note that the mapping pairs are unsigned 32-bit quantities. The LSB is on the right, so, for example, the single-byte character 0x5C is represented as 0x0000005C, and the two-byte code 0x81,0x54 is 0x0000815F.

size_t table_sz

This is the number of elements in the mapping_table. The user should take care that this is not the number of bytes in the array.

Example

Here is an example of a Microsoft SJIS->Unicode mapping.

os_str_conv::mapping mapping[] = {
      {0x0000005C,0x0000005C},
      {0x0000007E,0x0000007E},
      {0x0000815F,0x0000FF3C},
      {0x00008160,0x0000FF5E},
      {0x00008161,0x00002225},
      {0x0000817C,0x0000FF0D},
      {0x00008191,0x0000FFE0},
      {0x00008192,0x0000FFE1},
      {0x000081CA,0x0000FFE2},
};
void func(char* input,char* output) {
            ...
                                                            os_str_conv sjis_uni(os_str_conv::SJIS,os_str_conv::UNICODE);
      sjis_uni.change_mapping(mapping,sizeof(
            mapping)/sizeof(mapping[0]));
      sjis_uni.convert(output,input);
            ... 
}

In this example, mapping[] is a global, but a stack allocation would work as well.

Byte Order

Since Unicode is a 16-bit quantity, byte order depends on platform architecture. On little-endian systems, such as Intel, the low-order byte comes first. On big-endian systems (Sparc, HP, and Mips, for example) the high-order byte is first. There are three overloadings to the os_str_conv::convert() method to provide flexibility for dealing with this:

encode_type convert(char* dest, const char* src); 
encode_type convert(os_unsigned_int16* dest, const char* src); 
encode_type convert(char* dest, const os_unsigned_int16*

If a parameter is of char* type, all 16-bit quantities are considered big-endian, regardless of platform. However, if the type is os_unsigned_int16*, the values assigned or read are handled according to the platform architecture.

Updated: 03/31/98 17:04:12

Chapter 9

Using Asian Language String Encodings

Notes on encodings

Variations Among Standard Character Mappings

How to modify standard encodings

Restrictions