Novell Doc: NDK: LDAP Libraries for C

1.8 Character Conversions

This section contains reference information on character encoding and a description of UTF-8, the encoding used by LDAPv3.

1.8.1 A Brief History of Character Encoding

In the early days of computing, 7-bit ASCII was the standard. The need for more characters drove the creation of a number of 8bit Single Byte Character Sets (SBCS). ISO-8859 for example provided the 7-bit ASCII characters and many of the accented characters required for Wester Europe.

Asian languages required much more than 256 characters. Multi-byte character sets were developed using a variable number of bytes per character, such as Shift-JIS or EUC-JP.

Other encodings appeared that were stateful. They used Shift-In/Shift-Out characters, or escape sequences to switch between encoding schemes.

In an attempt to bring order to this confusion, two separate standards organizations started work on a Universal Character Set (UCS) which would encode all the characters of all the major languages in the world. The two organizations ultimately agreed to maintain a consistent encoding, and the ISO-10646/Unicode standard became widespread. ISO-10646 officially supports a 31-bit code space (0 - 0x7FFFFFFF), while Unicode supports the 21-bit space (0 - 0x10FFFF) of over a million characters. So far no characters have been assigned beyond the 16-bit Basic Multi-Lingual Plane (BMP). While the code point value assigned to each character are well defined, there are different ways that the value may be encoded.

UCS-2 refers to the encoding where each character is a fixed 16-bit length, allowing access to the BMP.

UCS-4 or UTF-32 refers to an encoding where each character is a fixed 32-bit value, allowing direct access to the entire UCS.

UTF-16 is an encoding where a character is one or two 16-bit values, allowing access to the full Unicode code space 0 - 0x10FFFF.

1.8.2 UTF-8 Encoding

There are a few problems with using these UCS-2/4 or UTF-16 encodings.

Since most characters used today are still from the 7-bit ASCII set, it takes almost twice as much space to use Unicode.
It is incompatible with many current APIs.
Byte order (big endian/little endian) is an issue.
If data is being sent across a byte stream, and a byte is dropped, all the rest of the 16 bit Unicode characters will be out of sync and there’s no way to sync up.

To address these problems, a byte-encoded form of Unicode was developed called Unicode Transformation Format 8-bit Encoding (UTF-8). This is just a simple algorithmic encoding of each 16-bit Unicode character into 1, 2, or 3 bytes. 4 bytes cover the entire Unicode 21-bit space, or 6 bytes to get the full 31-bit address space.

The greatest advantage is that the encoding for all 7-bit ASCII characters is identical in UTF-8. This solves the wasted space problem nicely, and provides a degree of compatibility with older systems. Byte order is not an issue since it’s a byte stream.

The encoding of UTF-8 also allows unambiguous determination of the start of a character. By examining only the first byte, one can determine the number of bytes in the UTF-8 character sequence. Continuation bytes are easily recognizable, allowing one to detect a missing byte in a stream.

RFC2279 describes the UTF-8 encoding format in detail. Many other resources on the Web, including the Unicode Consortium website contain more information.

1.8.3 UTF-8

In the LDAP version 2 specification, strings were limited to the T.61character set, which is basically 7-bit US-ASCII minus several characters (such as tilde, caret, and curly braces). T.61 was a severe limitation to globalization and efforts to make LDAP a world-wide standard. In LDAP version 3, strings are to be encoded in UTF-8.

Because 7-bit ASCII characters are encoded identically in UTF-8, many applications continue to use local text strings with the LDAP C APIs. This works for ASCII characters, but will fail for extended 8-bit characters such as, (e accent) or multi-byte Asian characters.

The correct approach is to make sure all local strings are encoded into UTF-8 before using them in an LDAP API. Likewise strings returned from the APIs should be converted to local text if required, such as displaying them with printf.

Novell’s LDAP C SDK provides routines for converting Unicode strings into UTF-8 strings. Both single character and string versions of these routines are provided. Several string processing routines are also provided, such as UTF-8 versions of strchr and strtok, next, and prev.

1.8.4 wchar_t Type

Novell’s SDK conversion routines use the wchar_t type. This type is 2 bytes on some machines and 4 bytes on other machines, so care must be taken if wchar_t strings are transported to other systems. UTF-8 is the most portable way to transfer strings between systems.

wchar_t strings will either be UCS-2 or UCS-4 encoded, depending on the size of wchar_t. The advantage to using wchar_t strings is that all the standard wide character string processing routines may be used, such as wcslen, wcscat, etc.

In summary, LDAP C applications which make the distinction between local and UTF-8 strings, and handle each properly, will be much easier to internationalize and move into the global marketplace.