Novell Documentation: NetWare 6 - Understanding Character Set Encodings

Understanding Character Set Encodings

A character set is a grouping of alphabetic, numeric, and other characters that have some relationship in common. For example, the standard ASCII character set includes letters, numbers, symbols, and control codes that make up the ASCII coding scheme.A character set encoding is the mapping of a character set to a value that can be understood and processed by a computer.

NetWare Web Search relies on character set encodings to identify the characters used when performing a search, reading a template, posting results to a Web browser, or indexing Web-based content. If the encoding information is missing in any of these areas, NetWare Web Search uses the default encodings identified in the SearchServlet and PrintServlet properties files. You can modify these settings using NetWare Web Search Manager.

Because most languages have several encodings that their character sets are identified by, NetWare Web Search Server supports a wide variety of character set encodings and encoding aliases.

Some examples of character set encodings include iso-8859-1, shift_jis, big5, and latin2. The official list of registered encodings is available from the Internet Assigned Numbers Authority (see Table 17). These are the official names for character sets that can be used in the Internet and can be referred to in Internet documentation. However, not all IANA-registered character set encodings are supported by NetWare Web Search Server. Refer to Table 17 for a list of encodings and encoding aliases that are supported by NetWare Web Search Server.

Unicode and UTF8

Unicode is a 16-bit character encoding standard developed by the Unicode Consortium. By using two bytes to represent each character, Unicode enables almost all of the written languages of the world to be represented using a single character set. Unicode does not require any special processing to access any character in any language.

This makes Unicode very easy to use when processing text from multiple languages and scripts. This is the reason NetWare Web Search converts all external files into Unicode for processing.

As already mentioned, Unicode is two bytes wide for all characters. Although this is ideal for computer processing, it doubles the size of all single-byte languages. This has a significant impact on Internet performance. For this reason, NetWare Web Search also supports an alternate representation of Unicode known as UTF-8. UTF-8 is a Unicode Transformation Format that uses sequences of 1 to 6 bytes to represent all the characters in the Unicode standard. Most notably, ASCII characters are transmitted without any conversion at all. This means that most Internet content is already in the UTF-8 representation. Many Asian languages, however, require three bytes per character in the UTF-8 format. Other languages can require up to six bytes to represent each of their characters.

You will have to decide if Unicode or UTF-8 best meets your needs when creating HTML content, Web Search templates, or search pages.

Search Encodings

The only encodings NetWare Web Search currently supports when performing a search are Unicode and UTF-8. Therefore, any page that allows Web users to enter a search must ensure that the results are passed to the server in one of these two formats. See Template Encodings for more information.

To pass Unicode characters to NetWare Web Search, use the syntax %uHHHH, where

Percent sign (%) is used as the CGI escape character

Lowercase letter U (u) indicate that the subsequent 4 characters represent a Unicode value.

Four uppercase H letters (HHHH) indicate four hexadecimal characters (0-9, A-F)

To pass UTF-8 characters to NetWare Web Search, just use normal ASCII characters or the syntax %HH... for all other characters, where

% is the CGI escape character

HH indicates two hexidecimal characters (0-9, A-F)

. . . indicates additional %HH groupings that might be required to properly transmit a character

HINT: If the encoding of the page containing a search form is already set to UTF-8 or Unicode, most browsers automatically transmit the entered search text correctly using the designated encoding.

By default, NetWare Web Search uses UTF-8 in its sample search pages.

Response Encodings

One of the many parameters that can be sent when conducting a search is the encoding that should be used when returning the results back to the browser. All NetWare Web Search encodings listed in Combined Character Sets for Use with NetWare Web Search can be used.

If the search result page contains the ability to refine or redo the search, then the response encoding can significantly impact the possible characters that can be entered when conducting the next search from this page. For example, if the user requests results in the iso-8859-1 encoding (HTML's default), then only iso-8859-1 characters can be entered in the subsequent search from that page. Other characters can still be sent to the Web Search services using the %uHHHH and %HH formats, but the browser will not allow users to enter normal text characters other than that supported by iso-8859-1.

Although Web Search can return search results from many languages, some characters found in titles and descriptions might be returned as question marks (?) indicating that these characters are not available in the current response encoding. If a character can be represented in the current encoding but a font is not available, many browsers will substitute an alternate character such as an empty box character. Once the appropriate fonts have been installed, these characters will then display properly.

By default, NetWare Web Search returns all search, print, and administration pages in UTF-8.

HTML Encodings

Since HTML content can contain text written in many character sets, all HTML files need to include a tag that identifies the character set encoding. To identify the encoding of an HTML file (or search template), use the following META tag at the top of the file's header section:

<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">

In this example, you would replace Shift_JIS with the appropriate Internet Assigned Numbers Authority (IANA)-assigned encoding value.

It is very important that the CHARSET value accurately represent the character set encoding that was actually used when the HTML Web content or Web Search template was created. A correct entry allows Web Search to accurately interpret and convert the characters in the document. An incorrect entry prevents Web Search from being able to read the characters as valid data in the authored language.

IMPORTANT: Improperly identified characters result in garbled text. In some cases, the Web-based content cannot be properly indexed or printed. In the most severe cases, the document being read might produce a server-side exception, which will ultimately discontinue processing the document and perhaps the entire current operation.

Because Web Search is Unicode-based, when reading templates or when indexing or printing HTML content, all character encodings are converted from their source encoding to Unicode for internal processing.

During indexing, if a document contains characters not supported by the designated encoding, if the document doesn't have an encoding designation, or if the designation is inaccurate, the indexer will do its best to recover. But if it cannot, it might index the information incorrectly or quit indexing that page entirely.

When reading a template file, Web Search might automatically cease processing the file if it contains any characters not supported by the current encoding. It will try to ignore the invalid text and continue, but this might not be possible.

When displaying search results or when printing HTML content, any character that does not match the specified response encoding will receive a question mark (?) in its place when rendered at the browser. Although some characters are properly supported by the current encoding, the browser might not have the required fonts to display the characters. In this case, users might see square boxes representing these characters. This is an indication that the valid character reached the browser, but the operating system could not provide a font to properly render the character. The user would than have to either change fonts or install the correct fonts in order to properly display the characters.

HINT: If a document does not contain a CHARSET encoding value, the default encoding for HTML documents is ISO-8859-1, also known as Latin1. The default encoding for plain text documents is US-ASCII.

Web Search also allows administrators to define the default encodings for templates, HTML content when printing, and search and print responses. Refer to the NetWare Web Manager Help for information about changing the default encodings.

Template Encodings

All HTML documents should include a Content-Type META tag identifying their character set encodings. The character set encoding allows HTML Web clients (or browsers) to understand the contents of the file. This tag is also used by browsers to automatically switch their display system and fonts to correctly show the Web page's contents. This lets users surf the World Wide Web without having to constantly change their display system as they encounter content from various languages and characters sets.

However, because NetWare Web Search lets administrator specify both template encodings and response encodings, browsers might get confused when presented with the valid response encoding in the HTTP header and one or more alternate encodings from the Content-Type META tags within the file that was part of the original Web Search template.

NOTE: $$IncludeFile[ ] templates can also contain their own Content-Type meta tags.

To solve this problem, NetWare Web Search allows placing the Content-Type META tag specifying the template's encoding within an HTML comment. This effectively obscures the original template encoding from the browser, but still allows Web Search to read the encoding when the template file is processed.

A sample Web Search template is illustrated below. The Content-Type META tag has been hidden inside of an HTML comment. This template can be embedded within other templates using the $$IncludeFile[ ] template variable without affecting Web Search's ability to distinguish between the various encodings. This file can also be processed and then sent to a user's Web browser without conflicting with the response encoding provided by Web Search in the HTTP response headers.

<html> <head></head> <body>

Template data here.</body> </html>

Encoding Issues When Printing

When NetWare Web Search processes a print request, it gathers the entire contents of each file and builds an appended print job page, one file after another. Each file can contain its own Content-Type META tag identifying its encoding. Each file's encoding will be used by Web Search to convert that file into Unicode before being sent out using the response encoding.

Unfortunately, all of these encoding META tags might confuse the browser's display system. While Web Search has already properly converted the files into a single response encoding, the browser sees the Content-Type META tags which direct it to do something else, and gets confused.

The way to solve this problem is to create a print results template that contains a Content-Type META tag encoding at both the top and bottom of the file, before and after the various documents get printed. All current browsers take either the first Content-Type META tag that they encounter or the last. Constructing a print template with both satisfies all browsers.