ObjectStore C++ API User Guide

Chapter 9

Using Asian Language String Encodings

There are many standards for encoding Asian characters. In Japan, for example, five encodings are in broad use: JIS, SJIS, EUC, Unicode, and UTF-8.

Usually an application uses one encoding for all strings to be stored inside a database. The encoding chosen is most often the one used in the operating system of the ObjectStore client.

However, if the application has heterogeneous clients using a variety of encodings, conversion from one encoding to another is necessary at some point. The clients could be traditional ObjectStore client processes or thin-client browsers that emit data in different encodings.

The Class Library: os_str_conv

This class library provides conversion facilities for various Japanese language text encoding methods: EUC, JIS, SJIS, Unicode, and UTF8.

The library provides a facility to detect the encoding of a given string. This is useful for applications in which a client might send strings in an unknown format, a common problem for Internet applications.

The most common application of this class is conversion between EUC and SJIS to provide sharing of data from UNIX <-> Windows applications. JIS is commonly used for email. Applications normally store data in a homogeneous format inside a database, and incoming strings are converted as required before they are persistently allocated. Outgoing strings can also be converted to the client's native encoding. For Web applications this outgoing conversion is usually not necessary since internationally aware browsers (Netscape 2.0 and above, for example) can automatically detect and convert various incoming formats themselves.

The class library currently consists of a single class, os_str_conv, instantiated once for each conversion path required:

os_str_conv(encode_type dst, encode_type src=automatic); 
Where

enum encode_type {    /* string encode type ------------------- */ 
UNKNOWN=0,              /* convert or automatic detect fail            */ 
AUTOMATIC,              /* detect automatically                               */ 
AUTOMATIC_ALLOW_KANA,
                                              /* detect automatically, allow half-width-kana */ 
ASCII,                                                               /* ASCII                                       */ 
SJIS,                                         /* Shift-JIS                                  */ 
EUC,                                          /* EUC                                          */ 
UNICODE,                                            /* Unicode (can't automatic detect)   */ 
JIS,                                         /* JIS                                                   */ 
UTF8                                         /* UTF-8 (can't automatic detect)     */ 
/* add new encode type here ! */ 
}; 
Here is an example. Given an instance of os_str_conv, such as

os_str_conv *sjis_to_euc = new os_str_conv(os_str_conv::EUC, 
os_str_conv::SJIS); 
A conversion can be done on char* sjis_src:

char *euc_dest = new char[sjis_to_euc->get_converted_size(
      sjis_src)]; 
sjis_to_euc->convert(euc_dest, sjis_src); 
The call to get_converted_size() is not strictly required; it is provided for the convenience of the user to allocate buffers of appropriate size. Because it requires examination of the entire source string, time to complete it is proportional to the source string length.

Automatic Detection of a Source String Encoding

Sometimes, it is not possible for an application to know the encoding of a given source string. os_str_conv provides methods that can analyze a given string and determine its encoding. For example:

os_str_conv *to_euc = new os_str_conv(
      os_str_conv::AUTOMATIC); len =
             to_euc->get_converted_size(unknown_src);
      if (len) 
      { 
      char *euc_dest = 
            new char[to_euc->get_converted_size(unknown_src)]; 
      to_euc->convert(euc_dest, sjis_src);
       } 
      else 
      { 
// couldn't convert -- application needs to handle this! 
      } 
Important Note: The autodetector is not guaranteed to work in all cases.

If it fails inside get_converted_size, get_converted_size returns 0 to indicate the failure. Be careful not to allocate strings based on its return value without checking for failure!

Unfortunately, no automatic detection algorithm can correctly distinguish EUC from SJIS in all cases because of overlap in their assignment ranges. Clever algorithms exploit patterns typical of real text. This implementation is reasonably straightforward. The most difficult problem (distinguishing between SJIS half-width kana and EUC) is avoided by asking the user to choose between the two possible interpretations. In nearly all cases, os_str_conv::AUTOMATIC is the appropriate setting.

In practice, the problems of ambiguity are not likely to affect applications, since usually incoming text is all in the single encoding defined by the operating system used when generating it. Autodetect can be used at the beginning of a session only, and it can be reasonable to assume that it will not change.

As mentioned earlier, there are areas in the EUC and SJIS encodings that overlap, and so a given string might be valid in either encoding. This makes autodetection ambiguous.

There are two ambiguous cases:

Detection is handled according to these rules:

  1. The algorithm examines each character of the string in sequence until the encoding is determined. Therefore, a string beginning with an unambiguous substring followed by an ambiguous substring is detected according to the first substring.

  2. Strings consisting entirely of the second ambiguous type are handled as unknown. As mentioned, this case is very unlikely.

  3. All SJIS half-width kana are single-byte encodings. Therefore, a string consisting entirely of an odd number of bytes in the SJIS half-width kana range is considered SJIS.

  4. A string beginning with an even number of bytes in the SJIS half-width kana range is ambiguous until the following characters are examined according to normal detection rules.

  5. A string beginning with an odd number of bytes in the SJIS half-width kana range requires special examination of the last character. If this is an EUC first-byte code, and it is followed by a valid EUC second-byte code, then the string is EUC. However, if the following code is not a valid EUC second-byte (it might be ordinary ASCII), then the final character is interpreted as SJIS half-width kana and the string is interpreted as SJIS.

  6. A string consisting entirely of an even number of bytes in the SJIS half-width kana range is ambiguous. It is quite possible for such a string to appear in real applications. The os_str_conv::automatic setting causes the autodetector to interpret this case as SJIS. However, if os_str_conv::automatic_allow_kana, this case is interpreted as unknown. Ojbect Design believes that the SJIS interpretation is correct for most cases.

Japanese developers are aware of the problems handling the half-width SJIS kana, and so they try to avoid them by using full-width SJIS kana instead.

Unlike EUC and SJIS, JIS is a modal encoding that uses <Esc> to enter and exit from multibyte mode. Detecting JIS strings is accomplished by searching for these <Esc> characters.

How to Instantiate the Converter

The class os_conv_str must be instantiated once for each conversion path required for your application.

Guidelines for Extensions to os_str_conv

Users can extend this class by inheriting from it. This could be useful for developers who want to override the existing autodetector.

Additional encodings can be appended to the existing enumeration. Note that os_str_conv depends on the ordering of the existing encodings, so if you extend os_str_conv, additional encodings must appear after the ones already provided.

What Are the Different Modes and Their Meanings?

Notes on encodings
For most purposes, there is a one-to-one mapping for characters to and from each of these encodings, so no semantic information is lost during conversion. There are four exceptions to this rule:

The second class of exceptions is considered extremely minor in practice, and is the result of different editions (1983 and 1990) of the JIS as the basis of SJIS and Unicode.

Variations Among Standard Character Mappings

The Unicode Consortium has published a general mapping from Shift-JIS to Unicode. However, actual implementations of the standard mapping differ slightly by platform and vendor. The os_str_conv class is implemented with a default mapping according to the Unicode Consortium standard, and also provides a means by which any mapping entry can be overridden at run time by a client application.

The deviations in mapping tend to be quite small. For example, here is a table that shows the incompatibility of the Unicode Consortium standard and the maps that Microsoft uses in Windows NT:
SJIS CodeUnicode Consortium Mapping Microsoft Mapping
\ 5C 00A5 YEN SIGN005C REVERSE SOLIDUS(*)
~ 7E 203E OVERLINE 007E TILDE
^[$B!@(B 81,5F 005C Reverse solidusFF3C FULLWIDTH REVERSE SOLIDUS
^[$B!A(B 81,60 301C WAVE DASH FF5E FULLWIDTH TILDE
^[$B!B(B 81,61 2016 DOUBLE VERTICAL LINE 2225 PARALLEL TO
^[$B!](B 81,7C 2212 MINUS SIGN FF0D FULLWIDTH HYPHEN-MINUS
^[$B!q(B 81,9100A2 CENT SIGN FFE0 FULLWIDTH CENT SIGN
^[$B!r(B 81,92 00A3 POUND SIGN FFE1 FULLWIDTH POUND SIGN
^[$B"L^(B 81,CA 00AC NOT SIGN FFE2 FULLWIDTH NOT SIGN

Instructions on Overriding Particular Mappings

How to modify standard encodings
To allow an application to modify the standard encoding on the fly, there is the following interface:

class os_str_conv { 
public:  
...
            struct mapping {
                  os_unsigned_int32 dest;      /* destination code */
                  os_unsigned_int32 src;         /* source code      */
             };
            int change_mapping(mapping table[],size_t table_sz);
      ...
      };
You can modify an existing instance of os_str_conv (whether heap- or stack-allocated) by calling os_str_conv::change_mapping(). Actually, internal mapping tables, shared by all instances of os_str_conv, are never modified. The additional mapping table information is stored to provide override information for future conversion services associated with that instance.

The override mapping information applies to whatever explicit mapping has been established for the given os_str_conv instance. Mappings of os_str_conv instances cannot be overridden by instances using autodetect. Attempts to do so return -1 from change_mapping() to indicate this error condition.

The change_mapping() method takes the following two parameters:

This is an array of mapping code pairs that can be allocated locally, globally, or on the heap. If the array is heap-allocated, the user must delete it after calling change_mapping().

Internally, change_mapping() makes a sorted copy of mapping_table[]. The sorting provides quick lookup at run time. The internal copy is freed when the os_str_conv destructor is eventually called.

Note that the mapping pairs are unsigned 32-bit quantities. The LSB is on the right, so, for example, the single-byte character 0x5C is represented as 0x0000005C, and the two-byte code 0x81,0x54 is 0x0000815F.

This is the number of elements in the mapping_table. The user should take care that this is not the number of bytes in the array.

Example

Here is an example of a Microsoft SJIS->Unicode mapping.

os_str_conv::mapping mapping[] = {
      {0x0000005C,0x0000005C},
      {0x0000007E,0x0000007E},
      {0x0000815F,0x0000FF3C},
      {0x00008160,0x0000FF5E},
      {0x00008161,0x00002225},
      {0x0000817C,0x0000FF0D},
      {0x00008191,0x0000FFE0},
      {0x00008192,0x0000FFE1},
      {0x000081CA,0x0000FFE2},
};
void func(char* input,char* output) {
            ...
                                                            os_str_conv sjis_uni(os_str_conv::SJIS,os_str_conv::UNICODE);
      sjis_uni.change_mapping(mapping,sizeof(
            mapping)/sizeof(mapping[0]));
      sjis_uni.convert(output,input);
            ... 
}
In this example, mapping[] is a global, but a stack allocation would work as well.

Byte Order

Since Unicode is a 16-bit quantity, byte order depends on platform architecture. On little-endian systems, such as Intel, the low-order byte comes first. On big-endian systems (Sparc, HP, and Mips, for example) the high-order byte is first. There are three overloadings to the os_str_conv::convert() method to provide flexibility for dealing with this:

encode_type convert(char* dest, const char* src); 
encode_type convert(os_unsigned_int16* dest, const char* src); 
encode_type convert(char* dest, const os_unsigned_int16* 
If a parameter is of char* type, all 16-bit quantities are considered big-endian, regardless of platform. However, if the type is os_unsigned_int16*, the values assigned or read are handled according to the platform architecture.

Overhead

Using overrides to the string conversion function incurs the following overhead:

Restrictions

Not all conversion combinations are possible. For example, it is impossible to convert Unicode to ASCII. This implementation guards against nonsensical requests, but developers who extend it should take care for such cases. Of course, Japanese to ASCII conversion is only possible on the ASCII subset of characters in the Japanese encodings. Attempts to convert Japanese strings to ASCII result in the return of an error condition.

Autodetect only detects SJIS, JIS, and EUC. Do not feed the autodetector Unicode or UTF-8 strings.

The EUC<->Unicode converter only works for characters in the SJIS set. While this might sound perverse, it is reasonable for actual applications, since characters outside the SJIS set are extremely rare.

Users should be aware that the 0 to 127 range of single-byte SJIS characters is not ASCII, even though the characters look like ASCII. This range is known as JIS-Roman. Specifically, the characters {'\', '~' , '|'} have different meanings. The practical significance is that the map of characters [0 to 127] from ASCII->Unicode->SJIS is not an identity.

Performance Notes

EUC and SJIS are very closely related since they both are based on the JIS ordering. Therefore, conversion between these requires no table lookup.

JIS conversion requires simple parsing for <Esc> characters. Once stripped of <Esc> characters, you can convert the multibyte sequences to EUC by setting the highest bit.



[previous] [next]

Copyright © 1997 Object Design, Inc. All rights reserved.

Updated: 03/31/98 17:04:12