Novell Doc: NDK: Libraries for C (LibC), Volume 1

4.2 Unicode

Unicode is a standard for character representation designed to accommodate every character in every language that is likely to be used in any computer application. Representation includes alphabetic, ideographic, and symbolic characters. Developed by companies that collectively constitute the Unicode Consortium, the standard uses a numbering system similar to ASCII characters, but has some fundamental differences. Most importantly, Unicode uses 16 bits for each character (UCS-2 encoding). This feature has several positive results:

Almost 65,000 characters can be represented, enough for every character of nearly every language in use today
Unicode eliminates the need for state checks (escape sequences) and interrupts when an application changes from one language to another or mixes characters from multiple languages

This section provides an overview of the following topics:

4.2.1 Why Use Unicode

Several advantages make it wise to incorporate Unicode into your programming practices.

Because all eDirectory™ strings are stored in Unicode format, applications enabled for eDirectory must use Unicode strings.

eDirectory is increasingly being accepted as an industry standard, providing a rapidly expanding market for eDirectory enabled solutions. All strings and paths in eDirectory are stored in Unicode format, so strings in such solutions must be stored in or convertible to Unicode. This is true for all applications, whether they are designed to be used internationally or not. Using Unicode is also a requirement of applications that take advantage of present or future Novell® services based on eDirectory. Across most eDirectory interfaces, less translation occurs because the strings are already in Unicode.
Unicode simplifies or eliminates many challenges associated with multibyte characters.

Because all Unicode characters are uniformly 16 bits long, Unicode eliminates the need to distinguish between single-byte and double-byte (multibyte) characters. This has at least two advantages:
- Moving a pointer from character to character is simply a matter of incrementing or decrementing.
- Unicode eliminates the need for special functions, and for precautions in those functions, to prevent landing in the middle of a multibyte character.
Because all Unicode characters are “in the same set,” Unicode makes it possible to mix characters from widely differing languages that would require separate code pages to represent.
As an industry standard, Unicode increases an application's attractiveness in countries and markets around the world.

4.2.2 Rule Tables

The UniGetTable function allows you to get a table pointer to the Unicode translation table corresponding to the local code page indicated in its first argument. However, if you are always translating strings in and out of the underlying host's local code page, you do not need to get a table pointer. You can use a built-in default code page by passing the UNI_LOCAL_DEFAULT flag in place of the table pointer.

Most applications, unless they happen to know that their strings are coming from a foreign locale and are using a different code page, will always want to pass UNI_LOCAL_DEFAULT.

The ability to load tables completely foreign to the host locale makes it possible for an application on a server in New York to translate strings originating from Bejing. This is not a likely scenario because strings coming from another locale would probably not be in multibyte, but rather in UTF-8 or Unicode. Nevertheless, the unilib.h interfaces support all possible multibyte string sources.

When LibC loads, it discovers the identity of the underlying code page and then gets this table pointer or handle. When your VM starts, the library merely initializes the calling its default handle to the one already in force for the entire host. For quick access, the table is already permanently loaded into memory.

4.2.3 Unmappable Characters

Unicode uses preestablished rule tables to map characters from one format to another. If a table does not contain a mapping for a given character, the library interfaces provide the following options for handling this problem:

You specify a character to use as the replacement character for any character not found in the rule table.
You can have the function return an error as soon as it finds an unmappable character.
You can supply your own function to handle the mapping of otherwise unmappable characters whenever the translating function finds such a character. For sample code, see EuroKeep.c.

4.2.4 Function Overview

The Unicode interfaces support copying, concatenating, indexing, searching, and comparing Unicode strings. The interfaces also support monocasing, both upper- and lower-casing as well as weighted comparisons (collation). The interfaces translate from Unicode, UTF-8, the local code page (ASCII, including multibyte).

The interfaces in the Unicode library have the following functional groupings:

Rule table management functions. For a list, see Unicode Rule Tables.
Translation functions that translate strings. See
Unicode utility functions that perform such string operations as collation, comparison, casing, copying, concatenation, matching, and indexing on Unicode strings. For a list, see Utility Functions.

Unicode Rule Tables

A Unicode table should be loaded first if you convert a local string on a NetWare host that has a different code page loaded. Any loaded tables should be unloaded on exit of your NLM™ or at the earliest opportunity in order to conserve resources on the NetWare server.

Specification:

  int UniGetTable ( int codePage, UniRuleTable_t *table );
  int UniSetDefault ( UniRuleTable_t table );
  int UniDisposeTable( UniRuleTable_t table );

Sample code:

  UniRuleTable_t table;
  
  err = UniGetTable(932, &table); // Load Japanese Unicode table
  
  // Do something useful
  
  if (table) UniDisposeTable(table); // Unload the table

From Local to Unicode

These functions convert local strings to Unicode strings and provide a varying degree of control for unmappable characters.

Specification:

  int loc2uni ( UniRuleTable_t table, unicode_t *dest, 
                const char *src, unicode_t noMapCh, 
                int noMapFlag );
  
  int locn2uni ( UniRuleTable_t table, unicode_t *dest, 
                 size_t *destLen, const char *src, 
                 size_t srcLen, unicode_t noMapCh, 
                 int noMapFlag );
  
  int locnx2uni ( UniRuleTable_t table, unicode_t *dest, 
                  size_t *destLen, const char *src, 
                  size_t srcLen, Loc2UniNoMapFunc_t *noMapFunc,
                  int noMapFuncParm, int noMapFlag );
  
  int loc2unipath ( UniRuleTable_t table, unicode_t *dest,
                    const char *src, size_t *dryRunSize );

Sample code:

  if (locn2uni(table, uni, &actSize, sjis, strlen(sjis), 0xFF,
      UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }

From Local to UTF-8

These functions convert local strings to UTF-8 strings and provide a varying degree of control for unmappable characters.

Specification:

  int loc2utf8 ( UniRuleTable_t handle, char *dest, 
                 const char *src, unicode_t noMapCh, 
                 int noMapFlag );
  
  int locn2utf8 ( UniRuleTable_t table, char *dest, 
                  size_t *destLen, const char *src, 
                  size_t srcLen, unicode_t noMapCh,
                  int noMapFlag );
  
  int locnx2utf8 ( UniRuleTable_t table, char *dest, 
                   size_t *destLen,
                   const char *src, size_t srcLen,
                   Loc2UniNoMapFunc_t *noMapFunc,
                   int noMapFuncParm, int noMapFlag );

Sample code:

  if (loc2utf8(table, utf8, sjis, 0xFF, UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }

From Unicode to Local

These functions convert Unicode strings to local code page strings and provide a varying degree of control for unmappable characters

Specification

  int uni2loc ( UniRuleTable_t table, char *dest, 
                const unicode_t *src, char noMapCh, 
                int noMapFlag );
  
  int unin2loc ( UniRuleTable_t table, char *dest, 
                 size_t *destLen, const unicode_t *src, 
                 size_t srcLen, char noMapCh,
                 int noMapFlag );
  
  int uninx2loc ( UniRuleTable_t table, char *dest, 
                  size_t *destLen, const unicode_t *src, 
                  size_t srcLen, Uni2LocNoMapFunc_t *noMapFunc,
                  int noMapFuncParm, int noMapFlag );
  
  int uni2locpath ( UniRuleTable_t table, char *dest, 
                    const unicode_t *src, size_t *dryRunSize );

Sample code

  if (unin2loc(table, buf, &buflen, uni, unilen(uni), 0xFF,
      UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }

From Unicode to UTF-8

These functions convert Unicode strings to UTF-8 strings and provide a varying degree of control for unmappable characters.

Specification:

  int uni2utf8 ( char *dest, const unicode_t *src );
  
  int unin2utf8 ( UniRuleTable_t table, char *dest, 
                  size_t *destLen, const unicode_t *src, 
                  size_t srcLen, char noMapCh,
                  int noMapFlag );

Sample code:

  if (uni2utf8(utf8, uni)))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }

From UTF-8 to Local

These functions convert UTF-8 strings to local strings and provide a varying degree of control for unmappable characters.

Specification:

  int utf82loc ( UniRuleTable_t handle, char *dest, 
                 const char *src, char noMapCh, 
                 int noMapFlag );
  
  int utf8n2loc ( UniRuleTable_t table, char *dest, 
                   size_t *destLen, const char *src, 
                   size_t srcLen, char noMapCh,
                   int noMapFlag );
  
  int utf8nx2loc ( UniRuleTable_t table, char *dest, 
                   size_t *destLen, const char *src, size_t srcLen,
                   Utf82LocNoMapFunc_t noMapFunc, 
                   void *noMapFuncParm, int noMapFlag );

Sample Code:

  if (utf82loc(handle, loc, utf8, 0xFF, UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }

From UTF-8 to Unicode

These functions convert UTF-8 strings to Unicode strings and provide a varying degree of control for unmappable characters.

Specification:

  int utf82uni ( unicode_t *dest, const char *src );
  
  int utf8n2uni ( unicode_t *dest, size_t *destLen, 
                  const char *src, size_t srcLen );

Sample Code:

  if (utf82uni(uni, utf8))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }

From/To Unicode and ASCII

These functions convert from Unicode strings to ASCII strings and back.

Specification:

  char *uni2asc ( char *dest, const unicode_t *src );
  
  char *unin2asc ( char *dest, const unicode_t *src, 
                   size_t nchars );
  
  unicode_t *asc2uni ( unicode_t *dest, const char *src );
  
  unicode_t *ascn2uni ( unicode_t *dest, const char *src, 
                        size_t nbytes );

Sample code:

  buf = asc2uni(uni, asc);

Utility Functions

These function enable you to manipulate Unicode strings. They are not to be preferred over the functions in the wchar.h file.

Task	Functions
Classification	UniClass_t unitype ( unicode_t ch );
Collation	int unicoll ( const unicode_t s1, const unicode_t s2 ); int unincoll ( const unicode_t s1, const unicode_t s2, size_t n );
Casing	UniCase_t unicase ( unicode_t ch ); unicode_t uni2mono ( unicode_t dest, const unicode_t src, UniCase_t casing ); unicode_t chr2upr ( unicode_t ch );unicode_t chr2lwr ( unicode_t ch );unicode_t chr2title ( unicode_t ch );unicode_t unilwr ( unicode_t string );unicode_t uniupr ( unicode_t string );unicode_t uni2lwr ( unicode_t dest, const unicode_t src );unicode_t uni2upr ( unicode_t dest, const unicode_t src );unicode_t uni2title ( unicode_t dest, const unicode_t src );
Length	size_t unilen ( const unicode_t string );size_t uninlen ( const unicode_t string, size_t max );size_t unisize ( const unicode_t *string );
Copy	unicode_t unicpy ( unicode_t tgt, const unicode_t src );unicode_t unincpy ( unicode_t tgt, const unicode_t src, size_t n );unicode_t uniset ( unicode_t base, unicode_t ch );unicode_t uninset ( unicode_t base, unicode_t ch, size_t n );
Concatenation	unicode_t unicat ( unicode_t tgt, const unicode_t src );unicode_t unincat ( unicode_t tgt, const unicode_t src, size_t n );unicode_t unilist ( unicode_t tgt, const unicode_t *s1, ... );
Comparison	int unicmp ( const unicode_t s1, const unicode_t s2 );int uniicmp ( const unicode_t s1, const unicode_t s2 );int unincmp ( const unicode_t s1, const unicode_t s2, size_t n );int uninicmp ( const unicode_t s1, const unicode_t s2, size_t n );
Character matching, indexing, and miscellaneous	unicode_t unichr ( const unicode_t string, unicode_t ch );unicode_t unirchr ( const unicode_t string, unicode_t ch );unicode_t uniindex ( const unicode_t string, const unicode_t search );unicode_t unistr ( const unicode_t as1, const unicode_t as2 );unicode_t unirev ( unicode_t base );size_t unispn ( const unicode_t string, const unicode_t charset );size_t unicspn ( const unicode_t string, const unicode_t charset );unicode_t unipbrk ( const unicode_t s1, const unicode_t s2 );unicode_t unitok ( unicode_t string, const unicode_t sepset );unicode_t unitok_r ( unicode_t string, const unicode_t sepset, unicode_t lasts );unicode_t unidup ( const unicode_t *s1 );
Converted string size	int LocToUniSize ( UniRuleTable_t table, const char str, size_t unmappedCharSize, int noMapFlag, size_t uniBufSize );int UniToLocSize ( UniRuleTable_t table, const unicode_t str, size_t unmappedCharSize, int noMapFlag, size_t locBufSize ); int LocToUtf8Size( UniRuleTable_t table, const char str, size_t unmappedCharSize, int noMapFlag, size_t utf8BufSize ); int UniToUtf8Size( const unicode_t str, size_t utf8BufSize );int Utf8ToLocSize( UniRuleTable_t table, const char str, size_t unmappedCharSize, int noMapFlag, size_t locBufSize );int Utf8ToUniSize( const char str, size_t uniBufSize );