Novell Home

Using DMOZ Open Directory Project lists with Novell BorderManager

Novell Cool Solutions: Feature
By Marcus Williamson

Digg This - Slashdot This

Posted: 11 Apr 2003
 

1.0 Introduction

The Novell AppNote "Using Public-Domain Site Blocking Lists with Novell BorderManager" (February 2003) described how it is possible to use publicly-available lists of sites, downloaded from the Internet, to regulate access to the web via Novell BorderManager.

That AppNote discussed the use of "Blacklists" - lists of sites which are not acceptable - and "Whitelists" - lists of sites which are acceptable - within a BorderManager environment, in conjunction with a third-party solution, Connectotel LinkWall.

Since the publication of the AppNote, many people have been in contact asking for more detailed information on the use of DMOZ open directory project (ODP) data in creating "Whitelists". There has been particular interest from users in medical environments, wishing to use the "Health" section of the DMOZ database as a "Whitelist". This would allow Administrators in hospitals to allow access to only health-related sites, plus a select few other sites, whilst denying access to all other sites.

This article aims to assist those who wish to use the DMOZ ODP data. It includes background information on DMOZ, how to obtain the DMOZ ODP data, how to produce an extract of the data and how to keep the list up-to-date automatically.

2.0 DMOZ

The DMOZ open directory project (ODP) is a public domain project to provide a human-edited index of the web. The DMOZ site at http://www.dmoz.org defines itself in the following way:

"The Open Directory follows in the footsteps of some of the most important editor/contributor projects of the 20th century. Just as the Oxford English Dictionary became the definitive word on words through the efforts of a volunteers, the Open Directory follows in its footsteps to become the definitive catalog of the Web.

The Open Directory was founded in the spirit of the Open Source movement, and is the only major directory that is 100% free. There is not, nor will there ever be, a cost to submit a site to the directory, and/or to use the directory's data. The Open Directory data is made available for free to anyone who agrees to comply with our free use license."

More information about the project can be found here: http://www.dmoz.org/about.html

DMOZ allows downloading of file dumps from the DMOZ ODP database. The dumps contain details of web sites in a format known as Resources Description Format (RDF), as described on this page: http://dmoz.org/help/getdata.html

RDF is a type of XML (eXtensible Markup Language) which describes sites within a hierarchy of categories, as stored in the DMOZ ODP database. A category represents a section of the ODP database, relating to a group of similar sites, as defined by the DMOZ ODP editors. For example, the "Health" category within the DMOZ database contains about 67,000 entries and can be found displayed in a hierarchical index here: http://dmoz.org/Health/

This page lists tools which can extract data from RDF files: http://dmoz.org/Computers/Internet/ Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/Upload_Tools/

Whilst there are many tools available for extracting and manipulating the RDF data, there was not until recently a tool specifically for the Novell NetWare/Novell BorderManager environment to perform the task of extracting useful data from the list.

Connectotel has developed RDFCONV, a freeware NLM to automate the task of extracting category information from the DMOZ ODP file. More information about obtaining and using RDFCONV.NLM can be found in sections 3.2 and 4.2 of this document.

3.0 Software Installation and Configuration

To make use of the DMOZ ODP list, you will need the following software:

  • Novell NetWare 4.x, 5.x or 6.x - the network operating system
  • Novell BorderManager or Connectotel Proxy Engine - the proxy server
  • Connectotel LinkWall - the site blocking software
  • Connectotel RDFCONV - the automatic extraction tool

You will also need a copy of the DMOZ ODP list, downloaded from the DMOZ site.

3.1 LinkWall

The LinkWall software can be downloaded from: http://www.connectotel.com/linkwall

Install the LinkWall software by following the instructions included in the documentation file LINKWALL.DOC.

Please ensure that the software is correctly configured for either BorderManager 3.5/3.6, or BorderManager 3.7, as described in the LinkWall documentation. Appendix B of the LinkWall documentation describes the correct contents for the files SYS:ETC\LINKWALL\LINKWALL.ACL and SYS:SYSTEM\LINKWALL.NCF.

Test the configuation by visiting the page: http://www.connectotel.com/linktest

as described in section 4.0 of the LinkWall documentation.

3.2 RDFCONV

The RDFCONV NLM software can be downloaded from: http://www.connectotel.com/netware/

Install the RDFCONV software by copying RDFCONV.NLM to the directory SYS:SYSTEM on the BorderManager server.

4.0 Obtaining and using the DMOZ ODP file

4.1 Obtaining the DMOZ ODP file

To obtain the DMOZ ODP file, use the following procedure:

  1. Download the DMOZ content file using the following URL:
    http://rdf.dmoz.org
    (This file can be downloaded at the desktop from within your browser, or using the Connectotel HTTPGET.NLM tool)
  2. Then extract the content.rdf.u8.gz file (gnuzip format), using a tool such as WinZip, to produce the file:
    content.rdf
    (Extraction of the GZ file can be performed at the server using the Connectotel UNCOMPR.NLM tool)
  3. Run the RDFCONV.NLM as shown in the following section.

4.2 Running RDFCONV

RDFCONV.NLM is a tool developed by Connectotel for extracting data from the DMOZ ODP file. RDFCONV.NLM is run from the server console in the following format:
rdfconv inputfilename outputfilename category

For example:
rdfconv sys:\dmoz\content.rdf sys:dmoz\health.txt Health /v

This example would extract the category "Health" from the file content.rdf and place it into a file named health.txt.

The available command line switches for use with RDFCONV.NLM are:

/v verbose mode - displays information about the running of the program

/s adds printing of the individual URLs as they are found.

/a auto-close the screen created when using /v mode

/c adds comments to the outputfile detailing the beginning and end of each sub-category

4.3 Editing LINKWALL.LST

Edit the file SYS:ETC\LINKWALL\LINKWALL.LST to include the names of the groups which you wish to deny or allow.

For example, this line would include the "Health" file created above:
$include sys:dmoz\health.txt

5.0 Implementation

5.1 Running LinkWall

The final stage is to activate the LinkWall software on the NetWare server using the LINKWALL.NCF file. To do this, type:

LINKWALL

at the file server console. This will run the file LINKWALL.NCF and display a screen indicating that LinkWall has been loaded and showing how many URLs have been read.

5.2 Allow or Deny mode?

By default, LinkWall loads in "Deny" mode, which means that it will block any sites found in the LINKWALL.LST file. If you wish to use LinkWall to manage a "Whitelist" then LinkWall should be run in "Allow" mode. In this case SYS:SYSTEM\LINKWALL.NCF should contain:

load sys:\etc\linkwall\linkwall /allow /version=1

This specifies "allow" mode and version 1 (BorderManager 3.5 or 3.6)

or

load sys:\etc\linkwall\linkwall /allow /version=2

This specifies "allow" mode and version 2 (BorderManager 3.7)

5.3 Staying Current

Any site blocking solution will only work well for as long as the site list files are kept up-to-date. In this respect, site blocking files can be compared to virus signature files, which must also be kept regularly updated.

The example below shows how to maintain the DMOZ ODP list file. Similar techniques can be used for maintaining any other public-domain site-list. An example using the squidGuard Blacklist can be found in the Novell AppNote "Using Public-Domain Site Blocking Lists with Novell BorderManager", February 2003.

5.3.1 Manual

Follow the procedure shown in section 4.0 above for downloading and uncompressing the DMOZ ODP file.

5.3.2 Automatic

The procedure in section 4.0 above can be automated by using server-based tools. An NCF (NetWare command file) reproduced in Appendix A below shows the use of commands including:

HTTPGET - to retrieve a file from a site using HTTP
Available free from Connectotel's site at http://www.connectotel.com/netware

DELAY - to wait between commands in the command file
Included with Novell NetWare

UNCOMPR - to extract the contents of a GZ file
Available free from Connectotel's site at http://www.connectotel.com/netware

The ZLIB zip/unzip library, used by UNCOMPR, is included with Novell NetWare.

The GETDMOZ.NCF file shown below will perform the following actions in sequence:

  1. Download the latest content.rdf.u8.gz file to sys:dmoz
  2. Extract content.rdf.u8.gz to produce sys:dmoz\content.rdf
  3. Run RDFCONV.NLM to extract the required categories (in this example, just the Health category)
  4. Unload the LinkWall software
  5. Reload the LinkWall software

6.0 Further Reading

For further information please see the following Novell AppNotes:

Appendix A - GETDMOZ.NCF

rem
rem Load ZLIB library
rem
load zlib
rem
rem Get DMOZ ODP RDF file
httpget dirt03.netscape.com /rdf/content.rdf.u8.gz sys:dmoz\content.rdf.u8.gz
rem
rem Wait for 10 minutes to complete the download
delay 600
rem
rem Uncompress ZIP file
uncompr sys:dmoz/content.rdf.u8.gz
rem
rem Wait for 5 minutes to complete the extraction
delay 300
rem
rem Extract the Health section from the RDF file
rdfconv sys:dmoz\content.rdf.u8 sys:dmoz\health.txt Health
rem
rem Wait for 5 minutes to complete the extraction
delay 300
rem
unload linkwall
rem
rem Wait for 10 seconds
delay 10
rem
rem Run LINKWALL.NCF
linkwall
echo Finished!

A Connectotel White Paper
April 2003
Copyright © 2002-2003 - Connectotel Ltd
http://www.connectotel.com/


Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com

© 2014 Novell