1.6 Unicode Converter

The Unicode converter supports the following features:

1.6.1 Standard and Extended Unicode Converter Functions

The newer Unicode contains two sets of functions- "standard" and "extended" functions.

"Standard" functions, which begin with NWUS*, provide a simple interface using the platform’s native country and code page. They follow standard conversion behavior options, such as how to handle an unmappable character. Standard functions follow a default conversion behavior that is not subject to adjustment by the developer.

"Extended" functions, which begin with NWUX*, allow the country code and code page to be specified, and allow extensive control over conversion options.

Standard Unicode Converter Functions

These functions allow for conversions between byte and Unicode strings and between different kinds of Unicode strings. These standard functions use only the converters supplied by Novell, and do not allow for other than default behavior.

Function

Header File

Description

NWUSByteToUnicode

nunicode.h

Converts a NULL-terminated byte string into a Unicode string.

NWUSByteToUnicodePath

nunicode.h

Converts a NULL-terminated file path bytestring into a Unicode string.

NWUSGetCodePage

nunicode.h

Returns the code page used to specify which converters are loaded.

NWUSLenByteToUnicode

nunicode.h

Converts a length-specified byte string into a NULL-terminated Unicode string.

NWUSLenUnicodeToByte

nunicode.h

Converts a length-specified Unicode string into a NULL-terminated byte string.

NWUSStandardUnicodeInit

nunicode.h

Loads the converters needed for the standard Unicode functions.

NWUSStandardUnicodeRelease

nunicode.h

Releases all resources allocated by the standard converter

NWUSUnicodeToByte

nunicode.h

Converts a Unicode string into a NULL-terminated byte string.

NWUSUnicodeToBytePath

nunicode.h

Converts a file path Unicode string into a NULL-terminated byte string.

NWUSUnicodeToUntermByte

nunicode.h

Converts a Unicode string into an unterminated byte string.

NWUSUnicodeToUntermBytePath

nunicode.h

Converts a file path Unicode string into an unterminated byte string.

NWUSUnicodeToLowerCase

nunicode.h

Converts a NULL-terminated Unicode string to Unicode lower case characters.

NWUSUnicodeToUpperCase

nunicode.h

Converts a NULL-terminated Unicode string to Unicode upper case characters.

Extended Unicode Converter Functions

The following functions offer more flexibility and choices that the standard Unicode functions offer.

Function

Header File

Description

NWUXByteToUnicode

nunicode.h

Converts a NULL-terminated byte string into a Unicode string.

NWUXByteToUnicodePath

nunicode.h

Converts a NULL-terminated file path byte string into a Unicode string.

NWUXGetByteFunctions

nunicode.h

Returns the functions used for handling unmappable bytes and special byte sequences during byte-to-Unicode conversion.

NWUXGetCharSize

nunicode.h

Returns the character size (1 or 2) of the next character in the byte string.

NWUXGetNoMapAction

nunicode.h

Returns the actions to follow when an unmappable byte sequence and an unmappable Unicode character are found.

NWUXGetScanAction

nunicode.h

Gets the status of current scan/parse functions.

NWUXGetUniFunctions

nunicode.h

Returns the functions used for handling unmappable Unicode characters and special Unicode sequences during Unicode-to-byte conversion.

NWUXGetSubByte

nunicode.h

Returns the substitution byte for the converter pointed to.

NWUXGetSubUni

nunicode.h

Returns the current substitution Unicode character for the converter.

NWUXLenByteToUnicode

nunicode.h

Converts a length-specified byte string into a NULL-terminated Unicode string.

NWUXLenUnicodeToByte

nunicode.h

Converts a length-specified Unicode string into a NULL-terminated byte string.

NWUXLoadByteUnicodeConverter

nunicode.h

Locates and loads a converter to convert between Unicode and the specified code page.

NWUXLoadCaseConverter

nunicode.h

Locates and loads a converter to convert Unicode to upper, lower, or title case (upper case for initial letter only).

NWUXResetConverter

nunicode.h

Resets the converter to a default state.

NWUXSetNoMapAction

nunicode.h

Sets the actions to follow when an unmappable byte or an unmappable Unicode character is found.

NWUXSetByteFunctions

nunicode.h

Specifies the functions to be used to handle unmappable bytes and special byte sequences during byte-to-Unicode conversion.

NWUXSetScanAction

nunicode.h

Enables or disables the current scan/parse functions.

NWUXSetSubByte

nunicode.h

Specifies the substitution byte for the converter.

NWUXSetSubUni

nunicode.h

Specifies the substitution character for the converter.

NWUXSetUniFunctions

nunicode.h

Specifies the functions to be used to handle unmappable Unicode characters and special Unicode sequences during Unicode-to-byte conversion.

NWUXUnicodeToByte

nunicode.h

Converts a Unicode string into a NULL-terminated byte string.

NWUXUnicodeToBytePath

nunicode.h

Converts a Unicode file path string into a NULL-terminated byte string.

NWUXUnicodeToUntermBytePath

nunicode.h

Converts a Unicode string into a an unterminated byte string.

NWUXUnicodeToUntermBytePath

nunicode.h

Converts a Unicode file path string into an unterminated byte string.

NWUXUnicodeToCase

nunicode.h

Converts a NULL-terminated Unicode string to upper case, lower case, or title case, depending on the converter pointed to.

NWUXUnloadConverter

nunicode.h

Unloads a converter and releases all associated resources.

1.6.2 Unicode Converter Implementation

Unicode Converter is based on a set of converters implemented as DLLs. These converter files may be placed in any directory where the system searches for DLLs (for example, C:\WINDOWS\SYSTEM).

Converter files follow a naming convention that designates both the converter type and the supported platform. The format of that convention is UNI_[TYP].[PLT], where [TYP] is the converter type and [PLT] is the supported platform. Extensions to designate supported platforms are as follows:

  • .W32-Windows 95 and NT
  • .NLM- NLM platform

There are 4 types of converters. All illustrations that follow use ".W32," the extension for Windows 95 and NT.

  • Byte/Unicode converters convert both from byte to Unicode and from Unicode to byte. The [TYP] component of the converter file name is the number specifying the desired code page.

    For example, UNI_1252.W32 is the converter for code page 1252 (W95/NT).

  • Case converters convert cases in one Unicode string to different cases in another Unicode string. The [TYP] component of the converter file name is MON for lowercasing, UPR for uppercasing, and TTL for titlecasing (first letter of each word capitalized). For example
    • UNI_MON.W32-Lowercasing
    • UNI_UPR.W32-Uppercasing
    • UNI_TTL.W32-Titlecasing
  • Collation converters collate Unicode strings according to the collation conventions of a specified country. The [TYP] component of the converter file name is the letter "C" followed by the country code of the specified country.

    For example, UNI_C1.W32 is the collation converter for country code 1 (US).

  • Normalization converters convert to precomposed or decomposed Unicode characters. The [TYP] component of the converter file name is PRE for converting to precomposed characters and DEC for converting to decomposed characters:
    • UNI_PRE.W32-Precomposing converter
    • UNI_DEC.W32-Decomposing converter

When an "extended" converter is opened, a handle is returned which is used in subsequent calls to extended converter functions. The developer may change various options for a particular converter without affecting other extended converters.

In contrast, once the standard converter is opened, it may be used by any number of programs. The developer cannot change preset standard converter options.

Default Conversion Behavior

For Unicode-to-byte and byte-to-Unicode conversion, the following behavior is automatic for standard functions and is default for extended functions. Standard functions provide for this behavior only, but extended functions allow extensive modification.

Unicode-to-byte Conversion

Unmappable Unicode characters result in a call to a function handler, which forms the basis of lossless round trip conversion. The handler converts each unmappable Unicode character into a string of six byte characters as follows:

  • Unmappable Unicode character U+NNNN becomes byte string "[NNNN]".

For example, if the character "#" is an unmappable Unicode "skull and crossbones" character (U+2620),

  • the Unicode input string "abc#def"
  • converts to the local byte output string "abc[2620]def".

Scan/parse functions are disabled.

Byte-to-Unicode Conversion

Unmappable byte characters result in a substitution by the standard Unicode REPLACEMENT CHARACTER-0xFFFD.

The scan/parse functions are enabled, reversing the Unicode-to-byte function handler behavior. These scan/parse functions scan for the byte sequence "[NNNN]", where NNNN is a string of four hexadecimal digits. Scan/parse convert each such sequence to a single Unicode character whose value is U+NNNN. For example, if the character "#" were again the Unicode "skull and crossbones" character (U+2620),

  • the local byte input string "abc[2620]def"
  • converts to the Unicode output string "abc#def".

Conversion Control

The standard converter allows only for Default Conversion Behavior, and the byte/Unicode converters uses that behavior as a starting default. However, the extended functions allow you the following choices:

You can set options other than the system defaults by calling NWUXSetByteFunctions, NWUXSetUniFunctions, and NWUXSetNoMapAction.

For more information, see:

Lossless Conversion

Many Unicode characters cannot be represented in a given local code page. However, situations arise when a Unicode string is converted to a local byte string, then converted back to Unicode. With the former Unicode API, any unmappable characters were lost in this process. The unicode Converter API functions provide the capability to convert from Unicode to local and back to Unicode without losing any information.

Supported Code Pages

Section 1.4, Supported Code Pages shows the code pages supported by Novell.

1.6.3 Conversion Operations

Although a standard Unicode converter behaves in fundamentally the same way on every platform this API supports, access to converters by any specific application can vary.

NetWare-global variables are global to the entire system. A standard converter initialized with NWUSStandardUnicodeInit or NWUSStandardUnicodeOverride is the only standard converter available to any application on the platform. If the converter is changed through a call to NWUSStandardUnicodeOverride, the change also affects any application requesting standard conversions.

Windows 95, Windows 98, and Windows NT -global variables are global only to a single process. Each process in which a thread calls NWUSStandardUnicodeInit or NWUSStandardUnicodeOverride gets its own copy of the global variables associated with the standard converter. It is therefore possible for one process to perform Unicode conversions with the system default codepage converter and for another process to perform conversions with an explicitly specified converter. Each process that calls a standard converter must also release the converter and its associated resources with a corresponding call to NWUSStandardUnicodeRelease.

For related information, see Unicode Converter Implementation.

Location of Converter DLLs

Converters in the Unicode Converter API set are installed during the installation process.

NWUSStandardUnicodeInit automatically loads a byte/Unicode converter for the native system code page, an uppercase converter, and a lower case converter.

One of the NWUXLoad... functions must be called to load an extended converter. The load functions return a separate handle to each converter loaded. That handle must be passed to any other extended converter functions involving the respective converter. Each NWUXLoad... function called should be followed with a corresponding call to NUWXUnloadConverter when the converter is no longer needed.

For related information, see:

Initializing/Loading Unicode Converters

NWUSStandardUnicodeInit must be called before using any of the standard converter functions. Each call to NWUSStandardUnicodeInit should have a corresponding call to NWUSStandardUnicodeRelease when the conversion operations are complete.

NWUSStandardUnicodeInit automatically loads converters for the following kinds of conversions:

  • Byte-to-Unicode and Unicode-to-byte conversions between the Unicode set and the native system code page
  • Conversions of Unicode strings to all upper case
  • Conversions of Unicode strings to all lower case

Other kinds of conversions require one or more extended converters and the functions in the extended ( NWUX...) set.

Each of the extended converters is called with a separate NWUXLoad... function, and each such call returns a handle that is specific to the converter loaded. That handle is then passed to any other extended functions that require the respective converter. When an extended converter is no longer needed, it should be unloaded with a call to NUWXUnloadConverter.

For related information, see:

Unterminated Byte Strings from Unicode Conversion

Four functions in this Unicode API conversion set provide for unterminated byte string output from Unicode input- NWUSUnicodeToUntermByte, NWUSUnicodeToUntermBytePath, NWUXUnicodeToUntermByte, and NWUXUnicodeToUntermBytePath.

These functions are identical to NWUSUnicodeToByte, NWUSUnicodeToBytePath, NWUXUnicodeToByte, and NWUXUnicodeToBytePath with one exception-the output byte string is unterminated. A trailing zero is not appended to the converted byte string.

In all other details-kinds of values that can be passed, operations performed, and numeric values returned-the above two sets of functions are identical.

For related information, see:

NWU_CONVERTER_NOT_FOUND Error

If the NWU_CONVERTER_NOT_FOUND error is returned when a standard converter is being initialized or an extended converter is being loaded, the converter DLL was not found in any of the expected locations. The system looks for the converter DLL (or NLM) in the usual system DLL/NLM search path. Where the system searches depends upon the operating system under which the application operates.

For example, in Windows 32 applications, the search order is

  1. The directory from which the application was loaded
  2. The current directory
  3. Varies between Windows 95 and Windows NT:
    • For Windows 95, the Windows system directory
    • For Windows NT
      • The 32-bit Windows system directory
      • The 16-bit Windows system directory
  4. The Windows directory
  5. Each component in the PATH variable

For NLM applications, the search order is

  1. The NLM search paths
  2. C: \NWSERVER
  3. C: \

1.6.4 Byte/Unicode Conversions

Differences between Novell and Microsoft Unicode translations tables in different languages have sometimes caused Unicode path strings to be stored with different path separators. Novell Unicode path conversion API functions called from any language recognize these differences and correctly convert any Unicode path separator back to the local path separator character.

Extended Byte/Unicode Converter Options

Extended byte/Unicode converters can convert either from Unicode to bytes or from bytes to Unicode, depending on the function called after the converter is loaded. Variable converter options include the code page and the country code, specified in the parameters of NWUXLoadByteUnicodeConverter.

Extended Actions for Unmappable Characters

With the extended Unicode API functions, you can select any of three actions when an unmappable character is encountered:

  • Return an error
  • Convert unmappable characters into a substitution character (default or user-defined)
  • Call a handler function (default or user-defined)

For example, if a Unicode string is being converted to local code page in order to be displayed, a user-defined handler function could convert an unmappable character into a red blinking question mark. The default handler inserts the hex value of the unmappable character enclosed in square brackets in place of the character, as explained in Default Conversion Behavior.

Multiple Code Page Converters

Previously, an open code page had to be closed before a new code page could be opened. Using the new extended API functions, you can have multiple byte/Unicode converters loaded and active simultaneously, each with a different code page. For each converter, a handle is returned when the load function completes.

It is important to unload each converter when it is no longer needed, as explained in Unloading Converters. This helps avoid the possibility of tying up system resources needlessly.

Unmappable Characters

This option defines what action to take when an unmappable Unicode character or (less likely) an unmappable byte is encountered during conversion. The options are to

  • return an error,
  • use a substitution character,
  • or call a handler function.

These options are set for individually for Unicode-to-byte conversion and byte-to-Unicode conversion. NWUXGetNoMapAction and NWUXSetNoMapAction specify both options. Refer to the function reference for details.

It is important to note that the default for Unicode-to-byte conversion is to call the handler function. That handler is described in Default Conversion Behavior.

Substitution Characters

If the NoMapAction is set to NWU_SUBSTITUTE, a substitute byte or Unicode character is output when an unmappable character is encountered. By default, NWU_SUBSTITUTE is set for Unicode-to-byte conversion and not set for byte-to-Unicode conversion.

The default substitution byte or Unicode character is determined by the converter, since different countries often have different preferences on what to display for undefined characters. For byte-to-Unicode conversion, the substitution character is U+FFFD, designated as the Unicode REPLACEMENT character. For Unicode-to-byte conversions, the converters generally set the default substitution byte to 0x03.

You can find out what the substitution characters is through NWUXGetSubByte or NWUXGetSubUni. You can set a new substituting character through NWUXSetSubByte or NWUXSetSubUni.

Fore related information, see:

Scan/Parse Action

Two scan action options are defined, one for converting Unicode-to-byte, and one for converting byte-to-Unicode. In the extended API, options are enabled or disabled through NWUXSetScanAction. By default, the scanAction is disabled for Unicode-to-byte and enabled byte-to-Unicode.

Enabling the option causes an automatic prescan of the input string for any special sequences and calls a parse function to replace such sequences with something else in the output string.

When the scanAction option is enabled, a Scan function is called internally to scan the input string before the conversion. If it finds a special sequence, the conversion is performed up to that point and then the Parse function is called internally. The functions are never called directly by the developer. Rather, they are set as explained in NoMap, Scan, and Parse Functions.

The system supplies default scan and parse functions for both byte-to-Unicode and Unicode-to-byte conversions. The byte-to-Unicode scan/parse functions operate as described in Default Conversion Behavior—where # is the Unicode "skull and crossbones character" (U+2620), the byte input string "abc[2620]def" becomes the Unicode output string "abc#def".

By default, scan/parse action is disabled for Unicode-to-byte conversion because the need for such action is very rare. If it is enabled, it operates in a similar way to byte-to-Unicode conversion. It scans for two hexadecimal digits surrounded by square brackets in a Unicode input string and converts them into a byte character of the same hexadecimal value in the byte output string.

For related information, seeSetting Scan/Parse Functions with an Extended Converter

NoMap, Scan, and Parse Functions

NWUXSetByteFunctions and NWUXSetUniFunctions set the NoMap, Scan, and Parse functions for the extended converter. The NoMap function is enabled if the NoMapAction is set to NWU_CALL_HANDLER, and the Scan and Parse functions are enabled if the ScanAction option is set to NWU_ENABLED.

The default behavior (the only behavior available for the standard converter) is to use the system supplied UniNoMap function and the ByteScan/Parse functions as described in Default Conversion Behavior. These functions implement round-trip conversion from Unicode to byte to Unicode.The developer may replace any of these functions with custom versions.

For related information, see Setting Scan/Parse Functions with an Extended Converter

Length-Specified Byte String Conversion

Unicode provides functions for converting a specified number of bytes from a byte string into Unicode characters.

For the standard converter, the functions are NWUSLenByteToUnicode and NWUSLenByteToUnicodePath.

For extended converters, the functions are NWUXLenByteToUnicode and NWUXLenByteToUnicodePath.

These functions behave exactly like their "Len" counterparts: NWUSByteToUnicode, NWUSByteToUnicodePath, NWUXByteToUnicode, and NWUXByteToUnicodePath, with the following exceptions:

  • Each "Len" functions allows a developer to specify the exact number of bytes to be converted.
  • Each of the "Len" functions can have an unterminated string for the input buffer.

If the length-specified function encounters a NULL before the specified number of bytes have been converted, it stops converting and returns NWU_EMBEDDED_NULL. However, it converts the bytes prior to the NULL, and returns the number of Unicode characters converted.

For example, consider the byte string abcdefgh for the following example:

ccode = NWUSLenByteToUnicode (&outbuf, MAX_LEN, inbuf, 5, &outlen);

On return ccode is zero, outbuf contains the Unicode string abcde, and outlen contains 5.

In contrast, given the byte string abc\0defg, the NWUSLenByteToUnicode function returns NWU_EMBEDDED_NULL. On return outbuf contains the Unicode string abc and outlen contains 3.

For related information, see

1.6.5 Case, Collation, and Normalization Conversion

Case converter options are set with the caseFlag parameter of include the following possibilities:

Constant

Result

NWU_LOWER_CASE

Converts a Unicode string to all lower case

NWU_UPPER_CASE

Converts a Unicode string to all upper case

NWU_TITLE_CASE

Converts the first letter of each word in a Unicode string to upper case

For related information, see: Converting Unicode String Case with an Extended Converter

Previous versions of the Novell Unicode API converted code page strings only into lower case Unicode strings. This limitation is now removed so that Unicode strings are no longer limited to lower case. Unicode strings now be converted to upper case or lower case with standard functions, and upper, lower, or title case (first letter of each word is capitalized) with the extended functions.

For related information, see