4.2 Unicode

Unicode is a standard for character representation designed to accommodate every character in every language that is likely to be used in any computer application. Representation includes alphabetic, ideographic, and symbolic characters. Developed by companies that collectively constitute the Unicode Consortium, the standard uses a numbering system similar to ASCII characters, but has some fundamental differences. Most importantly, Unicode uses 16 bits for each character (UCS-2 encoding). This feature has several positive results:

This section provides an overview of the following topics:

4.2.1 Why Use Unicode

Several advantages make it wise to incorporate Unicode into your programming practices.

  • Because all eDirectory™ strings are stored in Unicode format, applications enabled for eDirectory must use Unicode strings.

    eDirectory is increasingly being accepted as an industry standard, providing a rapidly expanding market for eDirectory enabled solutions. All strings and paths in eDirectory are stored in Unicode format, so strings in such solutions must be stored in or convertible to Unicode. This is true for all applications, whether they are designed to be used internationally or not. Using Unicode is also a requirement of applications that take advantage of present or future Novell® services based on eDirectory. Across most eDirectory interfaces, less translation occurs because the strings are already in Unicode.

  • Unicode simplifies or eliminates many challenges associated with multibyte characters.

    Because all Unicode characters are uniformly 16 bits long, Unicode eliminates the need to distinguish between single-byte and double-byte (multibyte) characters. This has at least two advantages:

    • Moving a pointer from character to character is simply a matter of incrementing or decrementing.

    • Unicode eliminates the need for special functions, and for precautions in those functions, to prevent landing in the middle of a multibyte character.

  • Because all Unicode characters are “in the same set,” Unicode makes it possible to mix characters from widely differing languages that would require separate code pages to represent.

  • As an industry standard, Unicode increases an application's attractiveness in countries and markets around the world.

4.2.2 Rule Tables

The UniGetTable function allows you to get a table pointer to the Unicode translation table corresponding to the local code page indicated in its first argument. However, if you are always translating strings in and out of the underlying host's local code page, you do not need to get a table pointer. You can use a built-in default code page by passing the UNI_LOCAL_DEFAULT flag in place of the table pointer.

Most applications, unless they happen to know that their strings are coming from a foreign locale and are using a different code page, will always want to pass UNI_LOCAL_DEFAULT.

The ability to load tables completely foreign to the host locale makes it possible for an application on a server in New York to translate strings originating from Bejing. This is not a likely scenario because strings coming from another locale would probably not be in multibyte, but rather in UTF-8 or Unicode. Nevertheless, the unilib.h interfaces support all possible multibyte string sources.

When LibC loads, it discovers the identity of the underlying code page and then gets this table pointer or handle. When your VM starts, the library merely initializes the calling its default handle to the one already in force for the entire host. For quick access, the table is already permanently loaded into memory.

4.2.3 Unmappable Characters

Unicode uses preestablished rule tables to map characters from one format to another. If a table does not contain a mapping for a given character, the library interfaces provide the following options for handling this problem:

  • You specify a character to use as the replacement character for any character not found in the rule table.

  • You can have the function return an error as soon as it finds an unmappable character.

  • You can supply your own function to handle the mapping of otherwise unmappable characters whenever the translating function finds such a character. For sample code, see EuroKeep.c.

4.2.4 Function Overview

The Unicode interfaces support copying, concatenating, indexing, searching, and comparing Unicode strings. The interfaces also support monocasing, both upper- and lower-casing as well as weighted comparisons (collation). The interfaces translate from Unicode, UTF-8, the local code page (ASCII, including multibyte).

The interfaces in the Unicode library have the following functional groupings:

Unicode Rule Tables

A Unicode table should be loaded first if you convert a local string on a NetWare host that has a different code page loaded. Any loaded tables should be unloaded on exit of your NLM™ or at the earliest opportunity in order to conserve resources on the NetWare server.

Specification:

  int UniGetTable ( int codePage, UniRuleTable_t *table );
  int UniSetDefault ( UniRuleTable_t table );
  int UniDisposeTable( UniRuleTable_t table );
  

Sample code:

  UniRuleTable_t table;
  
  err = UniGetTable(932, &table); // Load Japanese Unicode table
  
  // Do something useful
  
  if (table) UniDisposeTable(table); // Unload the table
  

From Local to Unicode

These functions convert local strings to Unicode strings and provide a varying degree of control for unmappable characters.

Specification:

  int loc2uni ( UniRuleTable_t table, unicode_t *dest, 
                const char *src, unicode_t noMapCh, 
                int noMapFlag );
  
  int locn2uni ( UniRuleTable_t table, unicode_t *dest, 
                 size_t *destLen, const char *src, 
                 size_t srcLen, unicode_t noMapCh, 
                 int noMapFlag );
  
  int locnx2uni ( UniRuleTable_t table, unicode_t *dest, 
                  size_t *destLen, const char *src, 
                  size_t srcLen, Loc2UniNoMapFunc_t *noMapFunc,
                  int noMapFuncParm, int noMapFlag );
  
  int loc2unipath ( UniRuleTable_t table, unicode_t *dest,
                    const char *src, size_t *dryRunSize );
  

Sample code:

  if (locn2uni(table, uni, &actSize, sjis, strlen(sjis), 0xFF,
      UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }
  

From Local to UTF-8

These functions convert local strings to UTF-8 strings and provide a varying degree of control for unmappable characters.

Specification:

  int loc2utf8 ( UniRuleTable_t handle, char *dest, 
                 const char *src, unicode_t noMapCh, 
                 int noMapFlag );
  
  int locn2utf8 ( UniRuleTable_t table, char *dest, 
                  size_t *destLen, const char *src, 
                  size_t srcLen, unicode_t noMapCh,
                  int noMapFlag );
  
  int locnx2utf8 ( UniRuleTable_t table, char *dest, 
                   size_t *destLen,
                   const char *src, size_t srcLen,
                   Loc2UniNoMapFunc_t *noMapFunc,
                   int noMapFuncParm, int noMapFlag );
  
  

Sample code:

  if (loc2utf8(table, utf8, sjis, 0xFF, UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }
  

From Unicode to Local

These functions convert Unicode strings to local code page strings and provide a varying degree of control for unmappable characters

Specification

  int uni2loc ( UniRuleTable_t table, char *dest, 
                const unicode_t *src, char noMapCh, 
                int noMapFlag );
  
  int unin2loc ( UniRuleTable_t table, char *dest, 
                 size_t *destLen, const unicode_t *src, 
                 size_t srcLen, char noMapCh,
                 int noMapFlag );
  
  int uninx2loc ( UniRuleTable_t table, char *dest, 
                  size_t *destLen, const unicode_t *src, 
                  size_t srcLen, Uni2LocNoMapFunc_t *noMapFunc,
                  int noMapFuncParm, int noMapFlag );
  
  int uni2locpath ( UniRuleTable_t table, char *dest, 
                    const unicode_t *src, size_t *dryRunSize );
  

Sample code

  if (unin2loc(table, buf, &buflen, uni, unilen(uni), 0xFF,
      UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }
  

From Unicode to UTF-8

These functions convert Unicode strings to UTF-8 strings and provide a varying degree of control for unmappable characters.

Specification:

  int uni2utf8 ( char *dest, const unicode_t *src );
  
  int unin2utf8 ( UniRuleTable_t table, char *dest, 
                  size_t *destLen, const unicode_t *src, 
                  size_t srcLen, char noMapCh,
                  int noMapFlag );
  

Sample code:

  if (uni2utf8(utf8, uni)))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }
  

From UTF-8 to Local

These functions convert UTF-8 strings to local strings and provide a varying degree of control for unmappable characters.

Specification:

  int utf82loc ( UniRuleTable_t handle, char *dest, 
                 const char *src, char noMapCh, 
                 int noMapFlag );
  
  int utf8n2loc ( UniRuleTable_t table, char *dest, 
                   size_t *destLen, const char *src, 
                   size_t srcLen, char noMapCh,
                   int noMapFlag );
  
  int utf8nx2loc ( UniRuleTable_t table, char *dest, 
                   size_t *destLen, const char *src, size_t srcLen,
                   Utf82LocNoMapFunc_t noMapFunc, 
                   void *noMapFuncParm, int noMapFlag );
  

Sample Code:

  if (utf82loc(handle, loc, utf8, 0xFF, UNI_MAP_CHAR))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }
  

From UTF-8 to Unicode

These functions convert UTF-8 strings to Unicode strings and provide a varying degree of control for unmappable characters.

Specification:

  int utf82uni ( unicode_t *dest, const char *src );
  
  int utf8n2uni ( unicode_t *dest, size_t *destLen, 
                  const char *src, size_t srcLen );
  

Sample Code:

  if (utf82uni(uni, utf8))
  {
  // conversion failed
  }
  else
  {
  // conversion was successful
  }
  

From/To Unicode and ASCII

These functions convert from Unicode strings to ASCII strings and back.

Specification:

  char *uni2asc ( char *dest, const unicode_t *src );
  
  char *unin2asc ( char *dest, const unicode_t *src, 
                   size_t nchars );
  
  unicode_t *asc2uni ( unicode_t *dest, const char *src );
  
  unicode_t *ascn2uni ( unicode_t *dest, const char *src, 
                        size_t nbytes );
  

Sample code:

  buf = asc2uni(uni, asc);
  

Utility Functions

These function enable you to manipulate Unicode strings. They are not to be preferred over the functions in the wchar.h file.

Task

Functions

Classification

UniClass_t unitype ( unicode_t ch );

Collation

int unicoll ( const unicode_t *s1, const unicode_t *s2 );

int unincoll ( const unicode_t *s1, const unicode_t *s2, size_t n );

Casing

UniCase_t unicase ( unicode_t ch );

unicode_t *uni2mono ( unicode_t *dest, const unicode_t *src, UniCase_t casing );

unicode_t chr2upr ( unicode_t ch );unicode_t chr2lwr ( unicode_t ch );unicode_t chr2title ( unicode_t ch );unicode_t *unilwr ( unicode_t *string );unicode_t *uniupr ( unicode_t *string );unicode_t *uni2lwr ( unicode_t *dest, const unicode_t *src );unicode_t *uni2upr ( unicode_t *dest, const unicode_t *src );unicode_t *uni2title ( unicode_t *dest, const unicode_t *src );

Length

size_t unilen ( const unicode_t *string );size_t uninlen ( const unicode_t *string, size_t max );size_t unisize ( const unicode_t *string );

Copy

unicode_t *unicpy ( unicode_t *tgt, const unicode_t *src );unicode_t *unincpy ( unicode_t *tgt, const unicode_t *src, size_t n );unicode_t *uniset ( unicode_t *base, unicode_t ch );unicode_t *uninset ( unicode_t *base, unicode_t ch, size_t n );

Concatenation

unicode_t *unicat ( unicode_t *tgt, const unicode_t *src );unicode_t *unincat ( unicode_t *tgt, const unicode_t *src, size_t n );unicode_t *unilist ( unicode_t *tgt, const unicode_t *s1, ... );

Comparison

int unicmp ( const unicode_t *s1, const unicode_t *s2 );int uniicmp ( const unicode_t *s1, const unicode_t *s2 );int unincmp ( const unicode_t *s1, const unicode_t *s2, size_t n );int uninicmp ( const unicode_t *s1, const unicode_t *s2, size_t n );

Character matching, indexing, and miscellaneous

unicode_t *unichr ( const unicode_t *string, unicode_t ch );unicode_t *unirchr ( const unicode_t *string, unicode_t ch );unicode_t *uniindex ( const unicode_t *string, const unicode_t *search );unicode_t *unistr ( const unicode_t *as1, const unicode_t *as2 );unicode_t *unirev ( unicode_t *base );size_t unispn ( const unicode_t *string, const unicode_t *charset );size_t unicspn ( const unicode_t *string, const unicode_t *charset );unicode_t *unipbrk ( const unicode_t *s1, const unicode_t *s2 );unicode_t *unitok ( unicode_t *string, const unicode_t *sepset );unicode_t *unitok_r ( unicode_t *string, const unicode_t *sepset, unicode_t **lasts );unicode_t *unidup ( const unicode_t *s1 );

Converted string size

int LocToUniSize ( UniRuleTable_t table, const char *str, size_t unmappedCharSize, int noMapFlag, size_t *uniBufSize );int UniToLocSize ( UniRuleTable_t table, const unicode_t *str, size_t unmappedCharSize, int noMapFlag, size_t *locBufSize );

int LocToUtf8Size( UniRuleTable_t table, const char *str, size_t unmappedCharSize, int noMapFlag, size_t *utf8BufSize );

int UniToUtf8Size( const unicode_t *str, size_t *utf8BufSize );int Utf8ToLocSize( UniRuleTable_t table, const char *str, size_t unmappedCharSize, int noMapFlag, size_t *locBufSize );int Utf8ToUniSize( const char *str, size_t *uniBufSize );