Tips and Tricks Using QuickFinder Server 4.0!
Novell Cool Solutions: Feature
By Peter Clemons
Digg This -
Posted: 1 Dec 2005
QuickFinder is one of Novell's little-known gems. And yet, you've probably used it many times since it's the search engine used, without alteration, on Novell.com and the award-winning Novell Support Web sites. The best part is that you probably already own it ...QuickFinder Server ships with Open Enterprise Server on both NetWare and Linux platforms and has been shipping with NetWare since version 5.1!
This article discusses some of the more significant things you can do to solve the common problems that arise when implementing a search on your corporate Web site and, of course, to improve your users' search success. Many of the tips and tricks discussed in this article are not available in any other QuickFinder-related resource, so you might want to bookmark this page and refer to it often.
To learn about how Novell has implemented QuickFinder on its own Web sites, see the companion article entitled Lost and Found - The Search Engine You Probably Already Own in the November/December issue of Novell Connection magazine.
Table of Contents
- Adding a QuickFinder search box to the header of your corporate Web site
- Search results Titles & Descriptions are bad
- Spotlighting your "Best" documents
- Upgrading from Web Search
- Endless indexing
- Some files aren't getting indexed
- Case-sensitivity on Linux
- Crawler sits a long time on a single URL, then fails after nine minutes
- How can I cancel an index that's generating without losing everything?
- Indexing remote file servers
- Failed indexes
- Synchronizing indexes takes a very long time
- Rights-based searching
- Dynamic index weights
- XML Search Reports
- Getting Help
- Just for fun
One of the first things search administrators want to do is add QuickFinder's search capabilities to the header section of their corporate Web site. To do so, just add the following HTML which will create both the search box to the left where users can enter their search criteria and the drop-down on the right which lists the available indexes to search in. The sample HTML actually creates a "search" button, but we opted for a search image on our internal Web site instead.
<form action="/qfsearch/search" method="get"> <input type="text" name="query" size="30"> <select name="index"> <option value="IndexName1">Index Name 1</option> <option value="IndexName2">Index Name 2</option> </select> <input type="submit" value="Search"> </form>
Alternatively, instead of hard-coding the indexes to display in the dropdown box, you can dynamically display all of the indexes defined for the current Virtual Search Server by placing the HTML defined above into a QuickFinder Search Page template, using the template variables $$Begin/EndServerIndexesLoop and $$ServerIndexName around the <option> lines to display all of the indexes, and bringing the result in as a server-side include using the following URL:
Of course, with QuickFinder's almost 50 query parameters, there's a lot more you can do with this form. Some of the more significant query parameters you might want to consider include:- server: the name of the Virtual Search Server to search in. In the absence of the server parameter, QuickFinder uses the www.yourdomain.com portion of the URL to determine the target VSS.
<input type="hidden" name="server" value="Your VSS Name Here">
- bbindex: he name of the Best Bets indexes to use for this particular query (allows you to use different Best Bets indexes from different parts of your Web site – a very useful thing to do). See Improving Search Results in the on-line documentation for more information.
<input type="hidden" name="bbindex" value="BBIndexName1;BBIndexName2">
- expandindex: the name of the indexes to search in if the primary indexes produce a "Not Found" condition.
<input type="hidden" name="expandindex" value="ExpandIndexName1;ExpandIndexName2">
- filefilter: performs the search within the name of the file rather than the standard full-content search. This query parameter is generally used in combination with the &query= parameter to "filter out" those hits that otherwise matched the user's query, but didn't match the specified filename extension(s). The &filefilter= query parameter can also be sent completely by itself; in this case, QuickFinder just does a filename search. For more information, see the Cool Solutions article entitled Finding a File Quickly Using NetWare Web Search that discusses this feature.
<input type="input" name="filefilter" value="user's filename search criteria">
- filter: Allows administrators to send additional query details not specified by the user to help limit the scope of the search. The value of this parameter is not seen by the user. For more information, see Using the &filter Query Parameter topic within the QuickFinder documentation.
<input type="hidden" name="filter" value="additional search details">
- template: the name of the template to use to display the search results. This is often used in conjunction with the &theme= query parameter. To learn more about templates, see Understanding Templates in the QuickFinder documentation.
<input type="hidden" name="template" value="name_of_search_results_template.html ">
<input type="hidden" name="theme" value="ThemeDirectoryName">
I see this almost every time I generate an index of a new Web site. Many of the documents on the site have missing or useless titles. Missing titles occur when your documents (Word, WordPerfect, PDF, HTML) simply don't have a Title field defined. It's a common, but critical mistake. Go back to your content providers and have them create useful, meaningful titles in all your corporate documents. In the absence of a title field, QuickFinder automatically uses the URL of the document as the title.
Useless titles often occur when indexing content that's been dynamically generated by a server-side application. For example, a forums server might create an HTML <TITLE> tag containing "Forum Message" as the title of every file it produces. Clearly, this is not very useful for customers looking for a particular message.
Another common problem that results in poor titles is a well-meaning web site owner who creates a template HTML document that all employees throughout the organization should use. However, rather than empty Title and Description tags, he creates the template with the requisite tags in place ...including a few instructions right within the tag on how to use them. "Put your title here" and "For this tag, write a good description telling the purpose of this document" end up being the Title and Description of every document on the web site; users of the template rarely replace the instructions with the actual values. Whatever you do with your HTML files, you should be careful to never create a situation where all documents have the same, redundant information in a field. It's much better to not have a field, than to have a useless one. Search engines can usually get around an absent tag, but they can't get around a valid tag that contains silly data.
Ensuring your best documents are at the top of the search results list is clearly one of the most important things you can do to provide a useful search capability to your customers.
Of course, the first rule of thumb should be that your documents actually talk about whatever the user is searching for. For example, if a user searches for a particular product, you'll want to be absolutely certain that the product's Web pages actually contain the name of the product ...and in prominent locations such as the title, keywords, description, and headings. For example, on Novell.com, the various product "home" pages consist primarily of flashy graphics and a ton of links. A quick review of QF's Relevance Algorithm exposes two problems with these pages:
- ll the text within links (which represents almost 100% of the text on Novell's product pages) is of little relevance since a link indicates that the "other" document talks about the topic rather than the current document.
- Neither QuickFinder, nor any other search engine at this time, can extract text from bitmap images.
The net result is that these pages are irrelevant and would never show up prominently in any search result list!
QuickFinder provides a host of features to ensure your best documents get the appropriate visibility they deserve. It's important that you learn how these features work. Used together, they can dramatically improve your search results. Some of the more significant features include artificial relevance adjustments, Best Bets, Synonyms, keyword Redirection, "Not Found" Search Expansion, Stop Words, Speller suggestions, Show First Hit, weighted queries, and making your documents themselves more relevant. To learn more about these features, read the Beyond Best Bets sidebar which is part of the article entitled Lost and Found - The Search Engine You Probably Already Own in the November/December issue of Novell Connection magazine. Also, refer to the extensive Optimizing Search Results section of the QuickFinder Server documentation.
Ultimately, Novell improved the relevance of its product pages by making content changes to these pages (titles, heading, etc.) and artificially boosting product pages within the index definition. These pages also inspired our most recent addition to QuickFinder's already very good relevance algorithm ...the more recent QuickFinder relases now include "depth" information (e.g., where the file is located within the Web site / file system) to help calculate relevance. Clearly, the "higher up" a page is on a Web site, the more relevant it should be.
On NetWare, QuickFinder will automatically copy all of the configuration settings from Web Search to the new QuickFinder directory (SYS:\qfsearch). Indexes that are under the SYS:\NSearch directory will automatically have new directories created under the SYS:\qfsearch hierarchy. Indexes that were defined on another volume will simply remain where they are. However, the older Web Search indexes will not be able to take full advantage of the newer QuickFinder features, so you'll have to regenerate those indexes before a user can perform searches. QuickFinder doesn't automatically regenerate the indexes because indexing can take a lot of horsepower and sometimes days to complete (if you have millions of files). That task is better left for you to decide when it should happen. Moreover, some products require a reboot after an upgrade which would cause problems for a currently running regeneration. The need to regenerate you indexes is mentioned in the upgrade dialogs.
Note: Once you've upgraded, you should modify the search forms throughout your Web site to point to the new /qfsearch invoker URL instead of the older /NSearch version. While the new QuickFinder continues to recognize the older invokers, searches will be faster if you update all your search forms.
Sometimes, QuickFinder's indexer will get caught indexing some Web site forever. This usually happens when crawling some type of dynamic content ...a calendar program, for example, that shows each day up to the year 20,000,000. Or an application server that allows multiple URLs to point to roughly the same content. QuickFinder's "fingerprint" functionality already detects a wide variety of duplicate situations, but sometimes the application slighly modifies the contents of the file (for example, the URLs on the page or the current date/time stamp) which slips past QuickFinder's ability to detect the file as a duplicate. The last situation where we've seen endless indexing is in some type of enormous URL or symbolic link loop. In any of these situations, you'll want to use an exclude command of some type to prevent QuickFinder from following the unwanted link path. To exclude files, you could use the Robots.txt Exclusion standard, a Robots meta tag, QuickFinder's unique <!--*Robots NoIndex> comment tags, a maximum "depth" setting, or the path and filename exclusion filters available on the index definition pages.
Sometimes, QuickFinder doesn't index something that you think it should. There are many reasons why this might occur, so it's hard sometimes to figure out why it's happening; but it usually occurs because of one of the following reasons:
- Exclusions: You've somehow told QuickFinder to not index the files (see the links mentioned in "Endless Indexing" above). It's usually a Robots exclusion, a path or filename exclusion, or you've failed to include the appropriate "starting" URL in the index definition.
- No Link Path: You don't have a valid link-path to the missing files from the starting URLs you've listed on the index definition pages. It sounds odd, but most Web sites have tons of files that don't have any links pointing at them.
The suggestion to create a special-purpose indexing page is useful in many situations. Generally, this is a page that's created strictly to allow search engines to find the information on your dynamic, or otherwise hard-to-reach, Web sites. It's generally not intended for end-users to interact with this file, and is therefore often a "hidden" part of your Web site. In most cases, you'll want to dynamically generate these pages as the search engine requests them. You should also use a Robots meta tag with the values NoIndex and Follow to prevent these "hidden" pages from showing up in your search results and a "Maximum index depth" setting (see Endless indexing above) of 2 so the crawler only fetches the URLs listed and doesn't go any further.
<meta name="robots" content="noindex,follow">
DOS, Windows, and NetWare all consider differences of character case in filenames as the same file. Therefore, by default, when QuickFinder's indexer encounters the files FILE1.DOC and file1.doc, it considers these as the same file. However, on Linux and sometimes the Internet itself, these are considered separate files. It's up to you to tell QuickFinder how it should treat case sensitivity as it indexes a set of files on either your file server or Web site.
If a server goes down that contains a URL that QuickFinder on Linux is trying to access, it might try for up to three minutes, then try again two more times for a total of nine minutes per failed URL. This situation can turn especially sour if a search administrator allows one or two levels of off-site URLs since administrators rarely know where all of the off-site links point and since many of these are often dead links.
Use the Linux tcp_syn_retries command to control the timeout value when connecting to a remote site. This sets the maximum number of attempts to retransmit initial SYNs for an active TCP connection. With a default value of 5, which is approximately 180 seconds, this value should always be less than 255.
Did you catch that? That's three minutes per attempt. And because QuickFinder automatically retries failed URLs on its own, it's best to set this value to 1, which corresponds to about 15 seconds per attempt. Future releases of QuickFinder on Linux will automatically set this value to 1.
It's happened many times. You define an index, then begin generating. While watching the progress of the index generation on the Active Index page, you realize you forgot to add something to the index definition. However, if you cancel the job now, you'll lose all of the valuable information in the indexing log file and you won't be able to do any test searches to see how your documents are showing up in the search results.
To resolve these issues, QuickFinder has added a new Stop feature in addition to the Cancel feature on the View Active Jobs page. Unlike Cancel, Stop simply quits finding new files and makes indexes of what it has collected so far. This lets you look at the indexing logs to see what's happening or what's gone wrong without losing all of the indexed information thus far. A perfect time to use this great feature is when you're defining new indexes.
The new "Stop" button tells QuickFinder to generate the index without fetching additional files.
Besides Web sites, QuickFinder is also able to index both local and remote file systems ...as long as the remote volumes can be mounted locally.
The best way to do this on NetWare is to install Novell's NFS Gateway on the the box that hosts the QuickFinder Search Server. Any remote volumes you wish to index should be exported as NFS volumes, then mounted on the QuickFinder box using the NFS Gateway product.
On SUSE Linux, mounting almost any type of volume is very easy. For Novell's NSS volumes, make sure the Tomcat user (novlwww) is a member of the www group and that both the www group and the novlwww user are LUM-enabled. This should have been done during the initial install. Once LUM-enabled, you'll need to give the novlwww user or the www group rights to read the NSS volume. To do this, use the rights program to grant trustee rights to the volume.
rights -rwf trustee novlwww (you might have to provide a full context)
Although QuickFinder has many error-recovery features built into both the indexer and the crawler, the fact is, sometimes index generation fails. There can be a number of reasons for this such as external servers going down, network failures, etc. Unfortunately, customers have been substantially unable to determine the cause of the failure since the "View Log" button on the "Indexes Maintenance" page displays only the most recent successful log. To remedy this problem, the OES version of QuickFinder has added a new "View Log" button on the "Active Index" page for a particular running job. Unlike the "View Log" button on the "Indexes Maintenance" page (which shows the crawled.log file), this "View Log" button shows the failed.log file ...something customers have never even known about in the past.
Clicking on the "View Log File" button displays the failed.log file.
QuickFinder provides a built-in ability to maintain a Virtual Search Server configuration across multiple QuickFinder servers. This is a very useful feature when you need more than one search server to handle high load situations. For example, on Novell.com, due to the large number of searches performed at peak times, we have configured a set of 4 QuickFinder OES Linux boxes to host the service. Each has the same Virtual Search Server configured in the same way with all of the same indexes, templates, etc. Two separate QuickFinder Server boxes have been set up to do nothing but continually regenerate the indexes. The speed problem occurs when synchronizing these indexes to the search servers.
Any time you copy huge files from one machine to another, it's going to take a long time. Because QuickFinder uses the HTTP PUT protocol for security reasons to synchronize the indexes, it takes even longer. However, the part of the communications pipeline that takes the longest is the use of the HTTPS (SSL) protocol. We have found that indexes synchronize 10 times faster between machines if admins use the unprotected HTTP protocol which can be configured on the Global Synchronization Settings page. Besides, if both the sending and receiving machines are behind the corporate filewall which they usually are, then there's really no need for the added security.
Disabling the use of HTTPS can speed the synchronization process ten fold.
To learn more about how Novell has implemented QuickFinder on its own Web sites, see the companion article entitled Lost and Found - The Search Engine You Probably Already Own in the November/December issue of Novell Connection magazine.
QuickFinder and Web Search have always been able to perform rights-based searching, which shows only those hits that users have rights to see. However, if a particular user doesn't have rights to the first 100,000 files, then it can take a long time to show any search results at all since QuickFinder has to ask both eDirectory and the file system if the user has rights to see each search result ...that's 100,000 file system and eDirectory calls before we can show even 1 file.
To mitigate the speed issues, both QuickFinder and Web Search have the ability to define rights at several levels of access control.
These settings are available from the Index Definition pages.
- Individual-file level / Defining access rights at this level is very secure, but relatively slow. For most situations, checking rights (and skipping) up to about 10,000 files doesn't slow the system down too much. Set the "Authorization Checking" to "by Result Item" and the "Check authorization by directory:" to No.
- Path level / If a user has rights to the first search result in a particular directory, then they have rights to all the files in that directory. The reverse of this is also true ...if the first file restricts access, then all the files are likewise restricted. This level of access control is still sufficiently secure, but much faster than Individual File-based rights. Set the "Authorization Checking" to "by Result Item" and the "Check authorization by directory:" to Yes. Note that this is the default setting.
- Index level / If a user has rights to the index control file, then they have rights to see all the search results from that index. In other words, we perform only 1 rights check (to the index file) for each query. This level provides mild security, but is very fast. Under high-load conditions, this is the best choice, but requires you to segregate your content into public and protected indexes. To enable this level of access control, set the "Authorization Checking" to "by Index". Note that you never ever want a user to have access to your indexes; therefore, QuickFinder uses an alternate, public file (which the admin specifies; see illustration) to evaluate users' rights to the protected index.
Using QuickFinder, you can now specify an index weight value on a query-by-query basis which overrides the index weight value you defined as a default when configuring the index. The new syntax is:
In other words, the &index= parameter can be sent multiple times. Each occurrence can specify multiple indexes and each index can include an optional weight value (:###). Weight values can range from 1 to 200.
Normal index query parameter: &index=QuickFinder+Server&index=DocRoot
Boosted index query parameter: &index=QuickFinder+Server:125&index=DocRoot
Note that the hits from the QuickFinder Server index have become more relevant.
Don't overlook the importance of the new weight value.You can now choose to place emphasis on a particular index by boosting its index weight value without excluding other indexes from the search as you've had to in the past. Yet on a different part of the Web site, you may choose to boost another index from the list, but always searching the same indexes. This way, you're no longer forced to eliminate other results by having users select the indexes to search in. They can now simply emphasize particular results as the needs arises. You can use the new index weight values with the following parameters:&index=
To gain maximum advantage from this tip, I'd suggest you create indexes for each major section of your site such as products, services, support, company, etc. Then, when users perform a search from one of these areas of your Web site, slightly boost the weight of the corresponding index (a value of 125 should do the trick nicely).
QuickFinder defaults to using the ReportTemplate.html file when generating weekly or monthly query reports. These reports provide a great deal of detail about user searches, search traffic patterns, errors, templates, etc. However, QuickFinder's raw query logs actually contain quite a bit of additional information that's not exposed in the basic query reports. Because the raw query logs are difficult to interpret, QuickFinder provides an alternative query reporting template to export the raw query details into XML format.
To generate an XML report, simply change the report template name to ExportTemplate.xml and click the "Generate Current" button. The resultant xml query log can then be retrieved into more sophosticated reporting and log-analyzing products.
Sample XML export of raw Query Report log file
Note: You can temporarily change the report template, highlight a desired date range from the Available Log Reports list box, then click Generate Current without clicking on the Apply button. This will use the specified report template name and date range to generate the report instead of the actual applied settings.
The "Generate Current" button uses the current report template name and highlighted report.
|E-mail:||One of the best ways to get help is to send an e-mail to firstname.lastname@example.org (used to be email@example.com). Your message will automatically be forwarded to several engineers (I'm on the list) who can answer any questions you might have about QuickFinder Server.|
|Product page:||Click the QuickFinder logo, found on any search results page to jump to the QuickFinder Server product Web site.|
|Documentation:||The documentation for QuickFinder Server 4.0 is located at novell.com/documentation/qfserver40.|
|Support:||To see all QuickFinder-related TIDs, search for QuickFinder tids on novell.com.|
We've asked one of our very capable graphics artists to create a QuickFinder Server-branded desktop background image for your workstations and OES servers. Just click on the thumbnail image at the right for the full-size (1600 x 1200) version.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com