Defining a Crawled Index, Advanced Page

Using the Advanced page of Defining a Crawled Index, you can add some additional fine-tuning to your new crawled index beyond giving it a name and specifying URLs.

Also, changes made on the Advanced pages override default index and site settings, allowing you to uniquely handle the settings of your new index.

Index Name

Enter a unique name, such as Sales or Marketing. If you are going to create a larger number of indexes, you might consider using a numbering system, such as 001, 002, 003, etc.

Index Description

This optional field lets you more clearly identify an index by adding a description. This can be helpful when you have many indexes in a single search site.

Web Sites to Crawl

This section of the page lets you define what Web sites to crawl and lets you include or exclude specific file types and subdirectories.

URL of Site

Enter the URL for the Web site you want included in this index. For example:

WWW.DIGITALAIRLINES.COM

WWW.DIGITALAIRLINES.COM/MARKETING

Subdirectories to Exclude

Enter relative paths to the subdirectories containing information you do not want included in your index in the Subdirectories to Exclude text box. Separate each additional extension with a single space or a hard return.

Extensions to Include or Exclude

If you want only specific file types (such as HTML, text files, or PDF documents) while indexing Web sites, select Include. You do not want specific file types, select Exclude.

For example, if you don't want PDF documents to be indexed, you would select Exclude and then specify the extension "PDF" in the Extensions To text box.

When entering two or more extensions, separate them with a single space or a hard return.

Additional URLs

This field lets you include additional URLs in cases where you want specific chunks of additional information located at other places (other than those specified in Web Sites to Crawl fields).

This is also an effective method of avoiding indexing unwanted information. For example, if you wanted only documents to be indexed in your company's marketing department, you might specify the following URL:

http://WWW.MARKETING.COM/MARKETING

If you leave this field blank, then the location where Web Search begins crawling is automatically determined from the URLs specified in Web Sites to Crawl.

Separate URLs with either a space or a hard return.

Additional Settings

Location of Index Files

One of the advantages of using this Advanced Editing page is that you can override the default location for storing indexes in this Search Site. If, for example, you know that a particular index is going to be considerably large, you might want to store it on an alternate volume.

Encoding (If Not in META Tags)

The primary purpose of specifying an encoding is to assist Web browsers in correctly interpreting a particular encoding. Many Web designers will include the correct encoding in an HTML Meta Tag so that Web browsers requesting their page will know how to interpret the characters on the page. But many Web designers do not include the Meta Tag. This features lets you specify a default encoding for all of the pages you index.

For example, if you create a search service for Web sites hosting predominately Japanese content, you would want to set this to a Japanese encoding, such as SHIFT_JIS or ISO-2022-JP. That way, when customers perform a search and their browsers request a particular page, Web Search will direct the user's browser to display the content using the Japanese encoding.

Maximum File Size to Index (Bytes)

Sometimes Web Search encounters very large files, such as a video files or even large PDF files. To control the indexing process so that Web Search doesn't get caught indexing a large file that could tie it up for an extended period of time, you can direct Web Search to not index files larger than the size you specify.

Maximum Time to Download a URL (Seconds)

Similar to Maximum File Size to Index, this field also lets you control what is being indexed.

Show File Names While Indexing

Select Yes if you want each filename to appear at the NetWare® console as it is being added to the index for the first time, or when the index is being regenerated.

URLs Are Case Sensitive

Some Web servers are case sensitive, others are not. For example, UNIX servers are case sensitive, while NetWare servers are not. So when crawling servers that are case sensitive, you will want Web Search to know that, so that as it generates dynamic links on a search results page, it uses the correct case.

If you do not direct Web Search to be aware of case sensitivity, some links might appear to be broken.

Crawl Dynamic URLs (URLs Containing '?')

Some search engines do not crawl Web pages that are created dynamically, such as pages generated from forms submitted by a Web browser. The URLs of dynamic content typically contain a question mark (?) followed by additional parameters.

This feature tells Web Search to go ahead and crawl dynamically generated web pages in addition to static pages.

NOTE: Because dynamic content can change at any time, you might want to schedule more frequent regenerating events for your indexes when enabling this feature.