Novell is now a part of Micro Focus

Getting a Grip on UTF-8 and Case in LDIF

Novell Cool Solutions: Tip
By Koen Verheyen

Digg This - Slashdot This

Posted: 13 Mar 2003

The Question

What is the difference between a CaseIgnoreMatch and CaseIgnorIA5Match is when creating attributes via LDIF in the schema?

I did some digging and can find the following out:

case insensitive, space insensitive
directoryString UTF-8 string

case insensitive, space insensitive
IA5String ASCII string

So what I can interpret is they are very similar matching syntaxes but for different types of data caseIgnoreIA5Match is for ASCII String data types and caseIgnoreMatch is for UTF-8 string data types.

So I searched on ASCII and UTF-8 and it did not get any clearer. It appears that these are also similar as from this description from the Linux Man Page

UTF-8 - an ASCII compatible multibyte Unicode encoding

I can not seem to find the doc I am looking for and that is one that explains what the diference is between these two strings, when you woulduse one over the other. If I get that I may be able to answer the clients questions.

The Scoop

ASCII is a character encoding using a single byte to store characters. To be able to include all kind of characters used in different parts of the world, many different "code pages" exist. But this causes trouble when storing multilingual info. Hence a new character encoding was required to overcome that problem.

Most Western European languages require less than two bytes per character. For example, characters from Latin-based scripts require only 1.1 bytes on average. Greek, Arabic, Hebrew, and Russian require an average of 1.7 bytes. Finally, Japanese, Korean, and Chinese typically require three bytes per character.

The big idea of Unicode (and how it got its name) was to store and manipulate characters as 32-bit integers and have one set instead of multiple. That is what UTF is all about. But convincing the world to suddenly use a 32-bit wide character encoding is not easy because of all the legacy systems. So you need a compromise: UTF-8.

UTF-8 is a multibyte encoding in which each character can be encoded in as little as one byte and as many as four bytes. Characters in the range U+0000 - U+007F can be encoded as a single byte. This means that the ASCII charset can be represented unchanged with a single byte of storage space. The characters in the range U+0800 - U+FFFF are Chinese, Korean, and Japanese.

UTF-8 Bit Encoding of a Unicode Code Point:

*Character Range* *Bit Encoding*
U+0000 - U+007F 0xxxxxxx
U+0080 - U+07FF 110xxxxx 10xxxxxx
U+0800 - U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The power of UTF-8 is that is bridges between 32 bit Unicode and 8 bit ASCII.

Other UTF versions are UTF-16 and UTF-32 storing characters as 16- or 32-bit values and hence not compatible with ASCII.

So, the matching rules would be the same if the data matched is ASCII. But when the data would be UTF-8 encoded and some characters would be multi byte, the ASCII matching would not work anymore.

Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions.

© Copyright Micro Focus or one of its affiliates