Combining Rsync, Quickfinder and NetStorage for Fun and Profit
Novell Cool Solutions: Feature
By Scott Flowers
Digg This -
Posted: 20 Oct 2005
Table of Contents
- Search Indexing
This article describes a method to capitalize on centralized disk-to-disk backups created with rsync, to simplify search indexing with Quickfinder, and then deliver content from the search results via NetStorage. The highlights of this method are that it takes advantage of centralized backup data to do search indexing, thereby saving on bandwidth that would be required to synchronize search indices, and it reduces the number of Quickfinder installations necessary to deliver centralized search over numerous file stores, simplifying your network.
In one of my previous articles I discussed using rsync to backup production servers in a mixed network environment including Linux, Windows and NetWare, onto a server running Linux with a large disk. Our company uses this method now almost exclusively to do backups. Read my previous article, Using Rsync for Flexible Powerful Backup and Self Serve Restore, for more information. Since that article, we have expanded our solution to backup from about 20 servers, storing nearly 5 TB of data as daily incremental backups, and since Novell has released OES Server on the Linux platform, we have implemented NetStorage on our backup server to allow authorized users to perform their own restores.
How rsync ties into Quickfinder and NetStorage for production data will be explained a bit later, so stick with me!
We have all this data, that our users dig into every day, but with the volume of work we do, it is getting difficult to capitalize on the work we've already done, and the difficulty arises from the sheer volume of data we have. We have the self-contradictory goal of keeping all our old stuff so we can re-use our knowledge. It is self-contradictory, because the more existing knowledge you have as a company, the harder it is to find the exact tidbit of information you need from your archive.
In the paper world, the solution to this problem is fanatical organization of your paper records. We do that, and we've just finished a project to improve how we do that, in our transition from a small company to a medium-sized one. Having excruciatingly organized paper records helps lawyers find stuff when they subpoena your records, which in our business happens from time to time. However, it doesn't really help your workers when they need a small piece of design information for something they completed three years ago (or even three months or weeks ago), and they need it now. That's where search comes in.
Quickfinder is Novell's search engine. It is a pretty powerful but simple web search engine, and it integrates nicely with Novell's management tools. Quickfinder can index content on web servers hosted on NetWare and Linux via the filesystem directly, or it can index any web content over the network. Indices can be created on one Quickfinder server, and then replicated to another one, allowing you to distribute the load of indexing by placing the indexer close to the data, but centralize searching across many sites. We have used web search for our Intranet and Internet content for several years, but now we are implementing it for our production data shared storage locations, which are several orders of magnitude larger. Quickfinder helps with this because it understands all our primary data types (with the notable exception of AutoCAD drawings), it's indexing operations are fast, the indices are compact, and the searching user experience is top notch.
We primarily work on our production data on shared storage locations on NetWare and OES Linux servers, which users access via the Novell Client. This method is great for LANs where there is a lot of bandwidth and the endpoints are all under the control of the IT department. However, some users never come into an office, or they are located at a remote site office for an extended period, with no IT support and no IT resources besides a printer and an Internet connection. These users don't get the benefit of the full network experience, but they still need access to project data.
For that purpose, we have implemented NetStorage on our production fileservers, and provided access to those users via their web browsers. This gives us the advantage of providing file access using the same access control permissions that we use for locally connected users without having to maintain the Novel client on remote computers, or even requiring remote-connected users to use one of the company's computers at all. We validate outside user identities with an extra layer of token-based security on top of the eDirectory credentials for added security.
Since we have implemented NetStorage for remote file access, we have a ready-made solution to offer web-based search results for file searches to our users via NetStorage. Users can search with Quickfinder, and have the search results (when stored on a NetWare or OES Linux server) presented to them via NetStorage, ensuring the company's security policy for access to data is upheld.
Extremely Distributed Data
We operate in the second-biggest country in the world in geographical area. If a person were to drive non stop between all our locations it would take about 40 hours. However, we only have 11 locations. We are spread out all over the map. That spread-out layout is mirrored in our office connectivity. We have vast virtual distances between our concentrated pockets of data. When we originally discussed providing high-speed search for all our production office data, we had initially planned to use Quickfinder installed on each file server where data was stored, to index it's own data store, and then replicate the indices to a centralized search server, where users could do a global search. This was a great way to minimize bandwidth and simplify things for our users. The indexing, which is fairly I/O intensive, could be done locally on each server, and the indices, which are much smaller than the full datasets, could be replicated over our WAN connections.
At the same time, we were coming to grips with the fact that our backup capabilities were being outgrown by our data, and we needed to build a more capable backup system. For this purpose we implemented centralized servers with massive (for us) storage, and learned how to use rsync, as described in my previous article, to backup our data to disk. Each day, with rsync, we only backup anything that's changed from each remote site, which is very economical on bandwidth. For convenience, however, every day's backup appears to an administrator, to be a complete backup of each server, with all files available in one place.
During one group meeting of our Corporate Technology Services group, when we were discussing the progress of these and other various projects, it occurred to us that since we had already started centralizing the backup of our data, we had local copies of all the production servers' data stores, that were only one day older than current. This prompted the idea that we could index the data centrally, and avoid transferring even the indices over the WAN.
Centralized Search Presenting Localized Results
We wanted the users to be presented with the most current files when they selected a search result from our search system. This precluded us from serving the search results from our backup server. However, we had already configured NetStorage on our production data stores for remote users. This enabled us to use NetStorage to allow the search results to point at the actual production files, rather than the backup archive. Users get the benefit of extremely rapid search capabilities from any office, and they can access any data that they have rights to via NetStorage directly from the search results.
This paper is written based on NetStorage as it shipped with OES Linux SP1 and OES NetWare 6.5 SP3. It should work for any subsequent NetStorage, and it works for previous NetStorage versions, but the URLs that are used to access NetStorage storage locations may change depending on the version.
Nothing special is required for NetStorage configuration to work with this solution, with one caveat: For every NetStorage storage location that you wish to use for search results, you have to create one Quickfinder index. For this reason, it simplifies things to minimize the number of NetStorage storage locations. To make an area of disk on your server available for access via Quickfinder search results, simply create a storage location object in iManager, pointing to the root of the area on disk that you want to access. Then, assign the storage location to the groups who are to be granted access. Remember the storage location object name. In the examples we are using Drive P as the storage location name.
Rsync Backup Configuration
This paper will not go into detail describing how to use rsync for disk-to-disk over the wire backup. For the purposes of discussion, we will assume that there is an rsync-ed replica of the production server data stores located on the server that will be running Quickfinder. In my case, I backup my production data using rsync so that the current backup is stored in /data/backups/servername/current/ and the data I am interested in indexing is in one directory lower called projects.
First, Quickfinder must be installed on your OES Linux server. During the setup you can choose to install Quickfinder, or you can install it via YaST after the installation is finished. In either case you must run the YaST Quickfinder configuration tool. This tool is located under Network Services in YaST on OES Linux. Run the Quickfinder configuration YaST module. If the Quickfinder packages are not installed already, this will cause them to be installed. You may need the OES Linux installation CDs. After the packages are installed, the Quickfinder service will be configured, and Apache and Tomcat will be restarted.
Once Quickfinder is configured and running, the web-based management tool will be available. it is accessed by going to https://servername/qfsearch/AdminServlet. If your Quickfinder server is running Linux, it seems that the only user that can login here is root. There may be a way to use a different user for this, so you may wish to consult the Quickfinder documentation. To get started, create a new file system index from the web administration tool. Give the index a user friendly name, like Edmonton Projects. Enter the filesystem location you want the indexer to index. In my case, here is the location.
The next setting is the URL prefix that the search server will use to attach to the beginning of the search results. This is what allows us to search the centralized backup archive, but in the search results, point the user at a completely different server, running NetStorage, that hosts the actual content. Assuming my server is called files.example.com, and my storage location that was configured earlier is called P Drive, then the NetStorage URL prefix that needs to be entered in Quickfinder is this:
Once all that is configured, save the index. Then, you can go back and edit the index to set a filter on all the file types you care about for your index, or configure any other administration settings. In my experience, if you are able to catalog all the file types you need to index, and then enter an Include filter in the index configuration page, indexing works much faster. Once the index is configured, you just have to go to the Manage tab and tell Quickfinder to generate your index. If you plan to update your data store regularly, you also need to configure scheduled updates for your indices. See the online help in Quickfinder for help with this.
Hopefully this paper is helpful to anyone who needs to implement searching of a document store hosted on NetWare or OES Linux. The solution provides conservation of bandwidth, fast indexing of distributed content, and delivery of the most current documents resulting from searches via NetStorage. Oh, and I was kidding about the Profit part.
- Rsync Website
- Rsync for NetWare
- Novell Quickfinder
- Novell Open Enterprise Server
- Novell Linux Solutions
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com