14.2 Understanding Character Set Encodings

A character set is a grouping of alphabetic, numeric, and other characters that have some relationship in common. For example, the standard ASCII character set includes letters, numbers, symbols, and control codes that make up the ASCII coding scheme. A character set encoding is the mapping of a character set to a value that can be understood and processed by a computer.

QuickFinder relies on character set encodings to identify the characters used when performing a search, reading a template, posting results to a Web browser, or indexing Web-based content. If the encoding information is missing in any of these areas, QuickFinder uses the default encodings identified in the SearchServlet and PrintServlet properties files. You can modify these settings by using QuickFinder Server Manager.

Because most languages have several encodings that identify their character sets, QuickFinder Server supports a wide variety of character set encodings and encoding aliases.

Some examples of character set encodings include iso-8859-1, shift_jis, big5, and latin2. The official list of registered encodings is available from the Internet Assigned Numbers Authority (see Section 14.4, Additional Resources). These are the official names for character sets that can be used in the Internet and can be referred to in Internet documentation. However, not all IANA-registered character set encodings are supported by QuickFinder Server. Refer to Section 14.4, Additional Resources for a list of encodings and encoding aliases that are supported by QuickFinder Server.

14.2.1 Unicode and UTF-8

Unicode is a 16-bit character encoding standard developed by the Unicode Consortium. By using two bytes to represent each character, Unicode enables almost all of the written languages of the world to be represented with a single character set. Unicode does not require any special processing to access any character in any language.

This makes Unicode very easy to use when processing text from multiple languages and scripts. This is the reason QuickFinder converts all external files into Unicode for processing.

As already mentioned, Unicode is two bytes wide for all characters. Although this is ideal for computer processing, it doubles the size of all single-byte languages. This has a significant impact on Internet performance. For this reason, QuickFinder also supports an alternate representation of Unicode known as UTF-8. UTF-8 is a Unicode Transformation Format that uses sequences of 1 to 6 bytes to represent all the characters in the Unicode standard. Most notably, ASCII characters are transmitted without any conversion at all. This means that most Internet content is already in the UTF-8 representation. Many Asian languages, however, require three bytes per character in the UTF-8 format. Other languages can require up to six bytes to represent each of their characters.

You need to decide if Unicode or UTF-8 best meets your needs when creating HTML content, QuickFinder templates, or search pages.

14.2.2 Search Encodings

The only encodings QuickFinder currently supports when performing a search are Unicode and UTF-8. Therefore, any page that allows Web users to enter a search must ensure that the results are passed to the server in one of these two formats. See Template Encodings for more information.

To pass Unicode characters to QuickFinder, use the syntax %uHHHH, where

  • The percent sign (%) is used as the CGI escape character

  • The lowercase letter U (u) indicates that the subsequent 4 characters represent a Unicode value.

  • Four uppercase H letters (HHHH) indicate four hexadecimal characters (0-9, A-F)

To pass UTF-8 characters to QuickFinder, just use normal ASCII characters or the syntax %HH... for all other characters, where

  • % is the CGI escape character

  • HH indicates two hexidecimal characters (0-9, A-F)

  • %HH indicates additional %HH groupings that might be required to properly transmit a character

HINT:If the encoding of the page containing a search form is already set to UTF-8 or Unicode, most browsers automatically transmit the entered search text correctly by using the designated encoding.

By default, QuickFinder uses UTF-8 in its sample search pages.

14.2.3 Response Encodings

One of the many parameters that can be sent when conducting a search is the encoding that should be used when returning the results to the browser. All QuickFinder encodings listed in Section B.0, Combined Character Sets for Use with QuickFinder can be used.

If the search result page contains the ability to refine or redo the search, then the response encoding can significantly impact the possible characters that can be entered when conducting the next search from this page. For example, if the user requests results in the iso-8859-1 encoding (HTML’s default), then only iso-8859-1 characters can be entered in the subsequent search from that page. Other characters can still be sent to the QuickFinder services by using the %uHHHH and %HH formats, but the browser does not allow users to enter normal text characters other than those supported by iso-8859-1.

Although QuickFinder can return search results from many languages, some characters found in titles and descriptions might be returned as question marks (?) indicating that these characters are not available in the current response encoding. If a character can be represented in the current encoding but a font is not available, many browsers substitute an alternate character such as an empty box character. After the appropriate fonts have been installed, these characters display properly.

By default, QuickFinder returns all search, print, and administration pages in UTF-8.

14.2.4 HTML Encodings

Because HTML content can contain text written in many character sets, all HTML files need to include a tag that identifies the character set encoding. To identify the encoding of an HTML file (or search template), use the following meta tag at the top of the file’s header section:

<meta http-equiv=”Content-Type” content=”text/html; charset=Shift_JIS”>

In this example, you would replace Shift_JIS with the appropriate Internet Assigned Numbers Authority (IANA)-assigned encoding value.

It is very important that the CHARSET value accurately represent the character set encoding that was actually used when the HTML Web content or QuickFinder template was created. A correct entry allows QuickFinder to accurately interpret and convert the characters in the document. An incorrect entry prevents QuickFinder from being able to read the characters as valid data in the authored language.

IMPORTANT:Improperly identified characters result in garbled text. In some cases, the Web-based content cannot be properly indexed or printed. In the most severe cases, the document being read might produce a server-side exception, which ultimately discontinues processing the document and perhaps the entire operation.

Because QuickFinder is Unicode-based, when reading templates or when indexing or printing HTML content, all character encodings are converted from their source encoding to Unicode for internal processing.

During indexing, if a document contains characters not supported by the designated encoding, if the document doesn’t have an encoding designation, or if the designation is inaccurate, the indexer attempts to recover. But if it cannot, it might index the information incorrectly or quit indexing that page entirely.

When reading a template file, QuickFinder might automatically cease processing the file if it contains any characters not supported by the current encoding. It tries to ignore the invalid text and continue, but this might not be possible.

When displaying search results or when printing HTML content, any character that does not match the specified response encoding receives a question mark (?) in its place when rendered at the browser. Although some characters are properly supported by the current encoding, the browser might not have the required fonts to display the characters. In this case, users might see square boxes representing these characters. This is an indication that the valid character reached the browser, but the operating system could not provide a font to properly render the character. The user then needs to either change fonts or install the correct fonts in order to properly display the characters.

HINT:If a document does not contain a CHARSET encoding value, the default encoding for HTML documents is ISO-8859-1, also known as Latin1. The default encoding for plain text documents is US-ASCII.

QuickFinder also allows administrators to define the default encodings for templates, HTML content when printing, and search and print responses. Refer to the QuickFinder Server Manager Help for information about changing the default encodings.

14.2.5 Template Encodings

All HTML documents should include a Content-Type meta tag identifying their character set encodings. The character set encoding allows HTML Web clients (or browsers) to understand the contents of the file. This tag is also used by browsers to automatically switch their display system and fonts to correctly show the Web page’s contents. This lets users surf the World Wide Web without constantly changing their display system as they encounter content from various languages and character sets.

However, because QuickFinder lets administrators specify both template encodings and response encodings, browsers might be confused when presented with the valid response encoding in the HTTP header and one or more alternate encodings from the Content-Type meta tags within the file that was part of the original QuickFinder template.

NOTE:$$IncludeFile[ ] templates can also contain their own Content-Type meta tags.

To solve this problem, QuickFinder allows you to place the Content-Type meta tag specifying the template’s encoding within an HTML comment. This effectively obscures the original template encoding from the browser, but still allows QuickFinder to read the encoding when the template file is processed.

A sample QuickFinder template is given below. The Content-Type meta tag has been hidden inside of an HTML comment. This template can be embedded within other templates using the $$IncludeFile[ ] template variable without affecting QuickFinder’s ability to distinguish between the various encodings. This file can also be processed and then sent to a user’s Web browser without conflicting with the response encoding provided by QuickFinder in the HTTP response headers.

<html>
<head><!-- Note that the HTML encoding command (meta tag) is hidden within a comment so that it does not affect a user’s browser display. - ><!-- The actual encoding used when sending this file to the user is controlled by the response encoding - ><!-- <META HTTP-EQUIV=”Content-Type” CONTENT=”text/html; charset=iso-8859-1”> - ></head>
<body>
Template data here.</body>
</html>

14.2.6 Encoding Issues When Printing

When QuickFinder processes a print request, it gathers the entire contents of each file and builds an appended print job page, one file after another. Each file can contain its own Content-Type meta tag identifying its encoding. Each file’s encoding is used by QuickFinder to convert that file into Unicode before being sent out using the response encoding.

Unfortunately, all of these encoding meta tags might confuse the browser’s display system. Although QuickFinder has already properly converted the files into a single response encoding, the browser sees the Content-Type meta tags that direct it to do something else, and becomes confused.

To solve this problem, you can create a print results template that contains a Content-Type meta tag encoding at both the top and bottom of the file, before and after the various documents are printed. All current browsers take either the first Content-Type meta tag that they encounter or the last. Constructing a print template with both satisfies all browsers.