There are many standards for encoding Asian characters. In Japan, for example, five encodings are in broad use: JIS, SJIS, EUC, Unicode, and UTF-8.
The Class Library: os_str_conv
This class library provides conversion facilities for various Japanese language text encoding methods: EUC, JIS, SJIS, Unicode, and UTF8.
os_str_conv(encode_type dst, encode_type src=automatic);Where
enum encode_type { /* string encode type ------------------- */ UNKNOWN=0, /* convert or automatic detect fail */ AUTOMATIC, /* detect automatically */ AUTOMATIC_ALLOW_KANA, /* detect automatically, allow half-width-kana */ ASCII, /* ASCII */ SJIS, /* Shift-JIS */ EUC, /* EUC */ UNICODE, /* Unicode (can't automatic detect) */ JIS, /* JIS */ UTF8 /* UTF-8 (can't automatic detect) */ /* add new encode type here ! */ };Here is an example. Given an instance of os_str_conv, such as
os_str_conv *sjis_to_euc = new os_str_conv(os_str_conv::EUC, os_str_conv::SJIS);A conversion can be done on char* sjis_src:
char *euc_dest = new char[sjis_to_euc->get_converted_size( sjis_src)]; sjis_to_euc->convert(euc_dest, sjis_src);The call to get_converted_size() is not strictly required; it is provided for the convenience of the user to allocate buffers of appropriate size. Because it requires examination of the entire source string, time to complete it is proportional to the source string length.
os_str_conv *to_euc = new os_str_conv( os_str_conv::AUTOMATIC); len = to_euc->get_converted_size(unknown_src); if (len) { char *euc_dest = new char[to_euc->get_converted_size(unknown_src)]; to_euc->convert(euc_dest, sjis_src); } else { // couldn't convert -- application needs to handle this! }Important Note: The autodetector is not guaranteed to work in all cases.
If it fails inside get_converted_size, get_converted_size returns 0 to indicate the failure. Be careful not to allocate strings based on its return value without checking for failure!
Unfortunately, no automatic detection algorithm can correctly distinguish EUC from SJIS in all cases because of overlap in their assignment ranges. Clever algorithms exploit patterns typical of real text. This implementation is reasonably straightforward. The most difficult problem (distinguishing between SJIS half-width kana and EUC) is avoided by asking the user to choose between the two possible interpretations. In nearly all cases, os_str_conv::AUTOMATIC is the appropriate setting.
In practice, the problems of ambiguity are not likely to affect applications, since usually incoming text is all in the single encoding defined by the operating system used when generating it. Autodetect can be used at the beginning of a session only, and it can be reasonable to assume that it will not change.
As mentioned earlier, there are areas in the EUC and SJIS encodings that overlap, and so a given string might be valid in either encoding. This makes autodetection ambiguous.
There are two ambiguous cases:
Unlike EUC and SJIS, JIS is a modal encoding that uses <Esc> to enter and exit from multibyte mode. Detecting JIS strings is accomplished by searching for these <Esc> characters.
Additional encodings can be appended to the existing enumeration. Note that os_str_conv depends on the ordering of the existing encodings, so if you extend os_str_conv, additional encodings must appear after the ones already provided.
The deviations in mapping tend to be quite small. For example, here is a table that shows the incompatibility of the Unicode Consortium standard and the maps that Microsoft uses in Windows NT:
Instructions on Overriding Particular Mappings
How to modify standard encodings
To allow an application to modify the standard encoding on the fly, there is the following interface:
class os_str_conv { public: ... struct mapping { os_unsigned_int32 dest; /* destination code */ os_unsigned_int32 src; /* source code */ }; int change_mapping(mapping table[],size_t table_sz); ... };You can modify an existing instance of os_str_conv (whether heap- or stack-allocated) by calling os_str_conv::change_mapping(). Actually, internal mapping tables, shared by all instances of os_str_conv, are never modified. The additional mapping table information is stored to provide override information for future conversion services associated with that instance.
The override mapping information applies to whatever explicit mapping has been established for the given os_str_conv instance. Mappings of os_str_conv instances cannot be overridden by instances using autodetect. Attempts to do so return -1 from change_mapping() to indicate this error condition.
The change_mapping() method takes the following two parameters:
Internally, change_mapping() makes a sorted copy of mapping_table[]. The sorting provides quick lookup at run time. The internal copy is freed when the os_str_conv destructor is eventually called.
Note that the mapping pairs are unsigned 32-bit quantities. The LSB is on the right, so, for example, the single-byte character 0x5C is represented as 0x0000005C, and the two-byte code 0x81,0x54 is 0x0000815F.
Example
Here is an example of a Microsoft SJIS->Unicode mapping.
os_str_conv::mapping mapping[] = { {0x0000005C,0x0000005C}, {0x0000007E,0x0000007E}, {0x0000815F,0x0000FF3C}, {0x00008160,0x0000FF5E}, {0x00008161,0x00002225}, {0x0000817C,0x0000FF0D}, {0x00008191,0x0000FFE0}, {0x00008192,0x0000FFE1}, {0x000081CA,0x0000FFE2}, }; void func(char* input,char* output) { ... os_str_conv sjis_uni(os_str_conv::SJIS,os_str_conv::UNICODE); sjis_uni.change_mapping(mapping,sizeof( mapping)/sizeof(mapping[0])); sjis_uni.convert(output,input); ... }In this example, mapping[] is a global, but a stack allocation would work as well.
encode_type convert(char* dest, const char* src); encode_type convert(os_unsigned_int16* dest, const char* src); encode_type convert(char* dest, const os_unsigned_int16*If a parameter is of char* type, all 16-bit quantities are considered big-endian, regardless of platform. However, if the type is os_unsigned_int16*, the values assigned or read are handled according to the platform architecture.
Autodetect only detects SJIS, JIS, and EUC. Do not feed the autodetector Unicode or UTF-8 strings.
The EUC<->Unicode converter only works for characters in the SJIS set. While this might sound perverse, it is reasonable for actual applications, since characters outside the SJIS set are extremely rare.
Users should be aware that the 0 to 127 range of single-byte SJIS characters is not ASCII, even though the characters look like ASCII. This range is known as JIS-Roman. Specifically, the characters {'\', '~' , '|'} have different meanings. The practical significance is that the map of characters [0 to 127] from ASCII->Unicode->SJIS is not an identity.
JIS conversion requires simple parsing for <Esc> characters. Once stripped of <Esc> characters, you can convert the multibyte sequences to EUC by setting the highest bit.
Updated: 03/31/98 17:04:12