Using XML for Enterprise Email Discovery - Part 2
Novell Cool Solutions: Feature
By Messaging Architects
Digg This -
Posted: 24 Oct 2006
Demystifying XML: XML vs SQL for Enterprise Email Discovery - Part 2
By Greg Smith
Review Part 1 of the Series, Anatomy of an Email Record
What You Need to Know about XML Data
The Extensible Markup Language, or XML, was put forth as a standard by the World Wide Web Consortium (W3C) and consequently is quickly becoming the dominant format for describing content on the web. Originally designed to store data separate of format, it is a meta-language that can be understood by virtually any software application, including RDBMS systems.
Simply put, a collection of XML records may almost be regarded as a database, where each XML-tagged data corresponds to an individual database record. The XML file defines specific fields for structured and unstructured data but is not restricted to a specific schema. It is ideally suited for storing heterogeneous document types and content that varies, e.g., text documents, images, audio files, hyperlinks, etc.
XML is also the ideal tool that allows enterprise-class search engines to retrieve relevant information momentarily, based on natural language searches that do not require knowledge of a query language. Different search engines work in various ways, but they all perform three basic tasks:
- Search the system (Internet, email system, corporate network, etc.) based on important words
- Keep an index of the words they find and where they find them
- Allow users to look for any combinations of words found in that index
In other words, the reason why search engines are able to return searches so quickly is due to indexing, not due to the particular choice of a data source.
One drawback of using XML as a database is that records and fields must be searched sequentially to discover information. Some may argue that this limitation may have a severe performance impact when storing millions of records. However, it is crucial to understand that XML is ideal for dealing with unstructured data because of the indexing capabilities the XML schema provides. This is the advantage search engines use. Otherwise, the XML data source is just that: a collection of records that represents a static copy of the archived email document, in a format that is open and accessible by a multitude of other applications.
It is the search engine that reads the data from the XML repository and federates the unstructured data contained there into a more humanistic format that can be quickly and easily searched and displayed, regardless of data format. Most major database vendors, including Oracle and IBM, are incorporating XML as a data source into their products.
Enterprise and web-bases search engines, such as Google and Yahoo, have shown their worth and scalability providing individuals with access to billions of web-based documents. In addition, these search engines are now reaching into the enterprise, indexing structured and unstructured data from other applications, such as RDBMS systems, web sites, desktops, electronic documents, and email systems.
Part of the lure for using enterprise search to conquer the issue of discovery in the context of email is that search engines can use advanced search algorithms that their RDBMS cousins were never designed to perform. By using a variety of sophisticated search methods, such as automatic categorization of results, taxonomies, data relevancy, result ranking, lemmatization, federated searches, plain English queries, etc, search engines can retrieve a highly accurate record set of the requested data.
Another advantage of the enterprise search engine is scalability. Traditionally, RDBMS systems obtained their performance through adding bigger and faster servers. Enterprise search engines emerged in the age of commoditized hardware which allows for linear scaling. Instead of adding huge multi-processor computers, enterprise search allows expansion by simply adding inexpensive PC-class machines to the matrix, thus increasing the capacity and performance of the system as a whole. While enterprise search does impose a greater requirement in terms of index sizes, the cost of maintaining these indexes is significantly lower than the overhead incurred with huge RDBMS systems.
About the Author
Greg Smith, MCNE and MCNI, has been working in the high-technology field for more than 15 years, predominantly with Novell Platinum integrators and resellers. Greg Smith is one of the main designers of
Messaging Architects' GWArchive, the only GroupWise-native email retention solution included in the Gartner Magic Quadrant for active archiving.
In his current position as Director of Professional Services at Messaging Architects, he brings his networking and messaging expertise to a company that specializes in GroupWise enhancements and product development. Greg has been active in the area of public speaking, giving technical presentations at GroupWise Advisor Summits, as well as at Novell BrainShare.
Novell Cool Solutions (corporate web communities) are produced by WebWise Solutions. www.webwiseone.com