Unicode is a standard for character representation designed to accommodate every character in every language that is likely to be used in any computer application. Representation includes alphabetic, ideographic, and symbolic characters. Developed by companies that collectively constitute the Unicode Consortium, the standard uses a numbering system similar to ASCII characters, but has some fundamental differences. Most importantly, Unicode uses 16 bits for each character (UCS-2 encoding). This feature has several positive results:
Almost 65,000 characters can be represented, enough for every character of nearly every language in use today
Unicode eliminates the need for state checks (escape sequences) and interrupts when an application changes from one language to another or mixes characters from multiple languages
This section provides an overview of the following topics:
Several advantages make it wise to incorporate Unicode into your programming practices.
Because all eDirectory™ strings are stored in Unicode format, applications enabled for eDirectory must use Unicode strings.
eDirectory is increasingly being accepted as an industry standard, providing a rapidly expanding market for eDirectory enabled solutions. All strings and paths in eDirectory are stored in Unicode format, so strings in such solutions must be stored in or convertible to Unicode. This is true for all applications, whether they are designed to be used internationally or not. Using Unicode is also a requirement of applications that take advantage of present or future Novell® services based on eDirectory. Across most eDirectory interfaces, less translation occurs because the strings are already in Unicode.
Unicode simplifies or eliminates many challenges associated with multibyte characters.
Because all Unicode characters are uniformly 16 bits long, Unicode eliminates the need to distinguish between single-byte and double-byte (multibyte) characters. This has at least two advantages:
Moving a pointer from character to character is simply a matter of incrementing or decrementing.
Unicode eliminates the need for special functions, and for precautions in those functions, to prevent landing in the middle of a multibyte character.
Because all Unicode characters are “in the same set,” Unicode makes it possible to mix characters from widely differing languages that would require separate code pages to represent.
As an industry standard, Unicode increases an application's attractiveness in countries and markets around the world.
The UniGetTable function allows you to get a table pointer to the Unicode translation table corresponding to the local code page indicated in its first argument. However, if you are always translating strings in and out of the underlying host's local code page, you do not need to get a table pointer. You can use a built-in default code page by passing the UNI_LOCAL_DEFAULT flag in place of the table pointer.
Most applications, unless they happen to know that their strings are coming from a foreign locale and are using a different code page, will always want to pass UNI_LOCAL_DEFAULT.
The ability to load tables completely foreign to the host locale makes it possible for an application on a server in New York to translate strings originating from Bejing. This is not a likely scenario because strings coming from another locale would probably not be in multibyte, but rather in UTF-8 or Unicode. Nevertheless, the unilib.h interfaces support all possible multibyte string sources.
When LibC loads, it discovers the identity of the underlying code page and then gets this table pointer or handle. When your VM starts, the library merely initializes the calling its default handle to the one already in force for the entire host. For quick access, the table is already permanently loaded into memory.
Unicode uses preestablished rule tables to map characters from one format to another. If a table does not contain a mapping for a given character, the library interfaces provide the following options for handling this problem:
You specify a character to use as the replacement character for any character not found in the rule table.
You can have the function return an error as soon as it finds an unmappable character.
You can supply your own function to handle the mapping of otherwise unmappable characters whenever the translating function finds such a character. For sample code, see EuroKeep.c.
The Unicode interfaces support copying, concatenating, indexing, searching, and comparing Unicode strings. The interfaces also support monocasing, both upper- and lower-casing as well as weighted comparisons (collation). The interfaces translate from Unicode, UTF-8, the local code page (ASCII, including multibyte).
The interfaces in the Unicode library have the following functional groupings:
Rule table management functions. For a list, see Unicode Rule Tables.
Translation functions that translate strings. See
Unicode utility functions that perform such string operations as collation, comparison, casing, copying, concatenation, matching, and indexing on Unicode strings. For a list, see Utility Functions.
A Unicode table should be loaded first if you convert a local string on a NetWare host that has a different code page loaded. Any loaded tables should be unloaded on exit of your NLM™ or at the earliest opportunity in order to conserve resources on the NetWare server.
int UniGetTable ( int codePage, UniRuleTable_t *table ); int UniSetDefault ( UniRuleTable_t table ); int UniDisposeTable( UniRuleTable_t table );
UniRuleTable_t table; err = UniGetTable(932, &table); // Load Japanese Unicode table // Do something useful if (table) UniDisposeTable(table); // Unload the table
These functions convert local strings to Unicode strings and provide a varying degree of control for unmappable characters.
int loc2uni ( UniRuleTable_t table, unicode_t *dest, const char *src, unicode_t noMapCh, int noMapFlag ); int locn2uni ( UniRuleTable_t table, unicode_t *dest, size_t *destLen, const char *src, size_t srcLen, unicode_t noMapCh, int noMapFlag ); int locnx2uni ( UniRuleTable_t table, unicode_t *dest, size_t *destLen, const char *src, size_t srcLen, Loc2UniNoMapFunc_t *noMapFunc, int noMapFuncParm, int noMapFlag ); int loc2unipath ( UniRuleTable_t table, unicode_t *dest, const char *src, size_t *dryRunSize );
if (locn2uni(table, uni, &actSize, sjis, strlen(sjis), 0xFF, UNI_MAP_CHAR)) { // conversion failed } else { // conversion was successful }
These functions convert local strings to UTF-8 strings and provide a varying degree of control for unmappable characters.
int loc2utf8 ( UniRuleTable_t handle, char *dest, const char *src, unicode_t noMapCh, int noMapFlag ); int locn2utf8 ( UniRuleTable_t table, char *dest, size_t *destLen, const char *src, size_t srcLen, unicode_t noMapCh, int noMapFlag ); int locnx2utf8 ( UniRuleTable_t table, char *dest, size_t *destLen, const char *src, size_t srcLen, Loc2UniNoMapFunc_t *noMapFunc, int noMapFuncParm, int noMapFlag );
if (loc2utf8(table, utf8, sjis, 0xFF, UNI_MAP_CHAR)) { // conversion failed } else { // conversion was successful }
These functions convert Unicode strings to local code page strings and provide a varying degree of control for unmappable characters
int uni2loc ( UniRuleTable_t table, char *dest, const unicode_t *src, char noMapCh, int noMapFlag ); int unin2loc ( UniRuleTable_t table, char *dest, size_t *destLen, const unicode_t *src, size_t srcLen, char noMapCh, int noMapFlag ); int uninx2loc ( UniRuleTable_t table, char *dest, size_t *destLen, const unicode_t *src, size_t srcLen, Uni2LocNoMapFunc_t *noMapFunc, int noMapFuncParm, int noMapFlag ); int uni2locpath ( UniRuleTable_t table, char *dest, const unicode_t *src, size_t *dryRunSize );
if (unin2loc(table, buf, &buflen, uni, unilen(uni), 0xFF, UNI_MAP_CHAR)) { // conversion failed } else { // conversion was successful }
These functions convert Unicode strings to UTF-8 strings and provide a varying degree of control for unmappable characters.
int uni2utf8 ( char *dest, const unicode_t *src ); int unin2utf8 ( UniRuleTable_t table, char *dest, size_t *destLen, const unicode_t *src, size_t srcLen, char noMapCh, int noMapFlag );
if (uni2utf8(utf8, uni))) { // conversion failed } else { // conversion was successful }
These functions convert UTF-8 strings to local strings and provide a varying degree of control for unmappable characters.
int utf82loc ( UniRuleTable_t handle, char *dest, const char *src, char noMapCh, int noMapFlag ); int utf8n2loc ( UniRuleTable_t table, char *dest, size_t *destLen, const char *src, size_t srcLen, char noMapCh, int noMapFlag ); int utf8nx2loc ( UniRuleTable_t table, char *dest, size_t *destLen, const char *src, size_t srcLen, Utf82LocNoMapFunc_t noMapFunc, void *noMapFuncParm, int noMapFlag );
if (utf82loc(handle, loc, utf8, 0xFF, UNI_MAP_CHAR)) { // conversion failed } else { // conversion was successful }
These functions convert UTF-8 strings to Unicode strings and provide a varying degree of control for unmappable characters.
int utf82uni ( unicode_t *dest, const char *src ); int utf8n2uni ( unicode_t *dest, size_t *destLen, const char *src, size_t srcLen );
if (utf82uni(uni, utf8)) { // conversion failed } else { // conversion was successful }
These functions convert from Unicode strings to ASCII strings and back.
char *uni2asc ( char *dest, const unicode_t *src ); char *unin2asc ( char *dest, const unicode_t *src, size_t nchars ); unicode_t *asc2uni ( unicode_t *dest, const char *src ); unicode_t *ascn2uni ( unicode_t *dest, const char *src, size_t nbytes );
buf = asc2uni(uni, asc);
These function enable you to manipulate Unicode strings. They are not to be preferred over the functions in the wchar.h file.