Searching and Indexing


Table of Contents

Introduction

Design Space

Systems


Introduction

To locate information in the WWW by any means other than naive hyperlink traversal requires the use of an index or some sort of query capability. These two topics are closely related, since an index can be thought of as a cached query, and indeed, many indices are created by using web crawlers that have much in common with query engines (e.g., similar traversal strategies and building of temporary indices). Further, some indices allow queries to be applied to their contents, while others simply are organized collections of hyperlinks.

Regardless of style, locating information in a document space (in this case the WWW) ultimately partitions the document space into two partitions: documents retrieved and documents not retrieved. Each of these partitions may contain both relevant and irrelevant documents and all shades of partial relevance in between. Precision measures how well the retrieved documents meet the needs of the user; i.e., how many irrelevant documents were retrieved. Recall measures how many of the relevant documents were actually retrieved. To improve precision, retrieved documents are often ranked by a presumed measure of relevance. To accurately measure precision and recall requires omniscient knowledge of the all document contents and the user's actual (not necessarily stated) needs. Needless to say, actual systems can only estimate precision and recall. The aim of any information retrieval system (web-based or otherwise) is to maximize precision and recall at reasonable cost, while minimizing complexity for the user.

These goals are of course mutually contradictory:

As such, systems generally tend to emphasize some goals more than others. As a trivial example, the list of references for a journal article (hopefully) has high precision for relevant related articles because it is hand constructed by an expert, will naturally not include many other relevant pieces of information (low recall), is expensive to build (for its size) because a human is involved in creating it, and has low complexity. Other kinds of mechanisms fall elsewhere in the design space.

For a general overview of indexing issues, see InfoSeek: Searching the Web. For a nice glossary of WWW terms, see NCSA Glossary.


Design Space

Tools for locating information can be classified by several different criteria:

Information Space & Query Models

This refers to how the tool and the user model and manipulate the information space. There are three considerations: how the tool presents information to the user, how information is represented inside the tool, and whether these two views are the same.

Tools are generally acted upon by the user as either an index to be browsed or as an interface through which queries (not necessarily SQL-like) may by formed and submitted. Tools such as Yahoo that present as an index invariably maintain internal indices. If a query interface is presented, an internal index may or may not be maintained. However, tools that manage large systems almost always create indices for internal use to reduce the costs of query evaluation. The internal and external views are not necessarily the same. Yahoo both stores and (primarily) presents as indices; Harvest and HyPursuit, construct and maintain indices for efficiency but present a query interface to the user.

An interesting alternative to subject-based searches are "social" searches [Erickson, CACM, 1/96, pg. 15], in which searching is based on profiles (stored or known) of "who would be likely to know about X" that are used as starting points for traversals. This is mechanically very similar to the use of subject indices, the difference being that non-documents or untraditional documents (personal pages) are used instead of subject indices.

How Information Becomes Organized

Querying or browsing implies that at some point the information space gets partitioned. This can be done in several different ways.

The information space may be organized manually by humans; this corresponds to edited collections and is generally confined to relatively small, cohesive collections within a larger information space. An interesting point made at a recent DTII meeting is to not underestimate the value of editors as a form of "information retailer" that repackages "wholesale" information available in digital libraries or on the WWW.

Alternatively, the information space may be automatically clustered using a variety of techniques. Most common are keyword, term occurrence in text, and hyperlink graph structure. Combinations of these clustering criteria are sometimes used. The clusters may then be indexed based on some function of the clustering criteria (e.g., most common terms).

Finally, there are hybrid approaches that use web crawlers to find documents fitting certain coarse groupings as the basis for an index that is then refined by a human. For example, a crawler could locate and sort documents that contained any of a collection of medical terms; the (presumably relevant) documents would then be reviewed to determine if and how they should fit into a medical index. Such indices can be quite large (e.g., Yahoo), but are labor-intensive and take longer to populate/maintain than fully automatic indices. On the other hand, they are more likely to have high precision and are capable (in principle) of capturing nuances in the documents that cannot be detected automatically with current technology.

How the Indexer Finds Documents to Index

Regardless of whether documents are indexed by a human or automatically, the documents must first be found somehow. For large indices, this is the job of a web robot (agent). The behavior of the robot has an enormous impact on the coverage of the resultant index, its currency, and whether the system is perceived as a good "Internet citizen".

Robots need to be designed with several main robot design principles in mind. Generally, web robots start with a collection of known documents and traverse links from these documents to find servers with new documents to index. These documents are then retrieved to the indexer for processing. The main concerns in web robot design are to not swamp network bandwidth, not swamp individual servers, and to respect the privacy of servers. Bandwidth conservation is done in several ways: not retrieving documents multiple times, only retrieving indexable document types (e.g., don't retrieve .gif files if you don't index pictures), off-hours retrieval, and minimizing the distance documents must travel (distributed indexers do this by processing at or near the server). Avoiding swamping the server is done by breadth-first searches (so as not to retrieve all documents from a single server in sequence), not multiply retrieving a document, and re-indexing on an appropriate schedule that does the most popular (and therefore probably most capable) sites first. Respecting the privacy of a server is done by allowing a way for the server to place itself (partially) off-limits to the robot and by providing a feedback channel.

Views

Views are static, periodically re-evaluated queries that serve to provide contexts against which more specific queries can be posed. For example, a view could be used to create an index of only "computer science" documents, which could then be searched for "channels". Views are usually created at servers (i.e., users don't create them). Views help to disambiguate vocabulary when terms have different meanings in different contexts. Thus, in addition to speeding up retrieval, the use of specialized views tends to make search results more precise.

Query Refinement & Querying Queries

A problem common to all information retrieval systems, whether Web-based or not, is phrasing a query that returns the right amount of the desired information. This means both getting the query predicate and search target correct and getting the right precision and recall given the user's desire at that time. While a user certainly can keep submitting new queries until a more or less "right" result is returned, the more usual way of doing things is to have the search engine provide some mechanisms for tuning the query or re-querying previously obtained results. This has benefits in that it is easier for the user to modify an existing query than to create a new one, and because it possibly reduces the computation required to evaluate subsequent queries.

We first examine several techniques to present query refinement to the user, and then examine how these are supported inside the search engine and look at the potential performance implications of query refinement.

Presentation

Several different approaches have been taken to query improvement in WWW search engines. A given engine normally supports more than one of these features, since in general they have different objectives.

Internals and Performance

It is often cheaper for a query engine to process a refined query than to start over with a presumably new query. This is one advantage to supporting query modification rather than letting the user do it on an ad hoc basis.

Centralized vs. Distributed

Indices can be maintained and searches performed either at a centralized location by a single engine, or at distributed locations by multiple engines. In either case, indices and engines can be replicated to provide greater throughput. Replication does not change the basic premise that in one case all information is available at a single site whereas in the other case it is not. In centralized engines, indices are built by doing a (generally breadth first) retrieval of documents reached from some collection of initial root documents. Retrieved documents are brought to the server, where summary information (terms and links) are extracted. Needless to say this is expensive in terms of bandwidth. Links extracted from documents are used for further searches; pruning may occur to prevent documents from being retrieved multiple times or to trim links that emanate from "uninteresting" documents.

The general strategy in a distributed engine is that leaf servers index (or search) documents as if each were a single, centralized server. Summaries of these indices are reported up to higher level indices until a top level index is reached. The contents of the summary and the way in which it is determined is an important variant between distributed engines. Major variants are to report all terms sufficiently important to be included in the local index, or to cluster the indexed documents and report the cluster summaries or cluster centroids (HyPursuit). In the latter case, a cluster is treated by the higher level server as a "document" to be indexed or clustered in turn, thus making the whole process recurse. Note that distributed engines are not necessarily hierarchical, since there is no inherent need for a strict partitioning of the indices; in fact, DAGS are a more common configuration.

Distributed engines all have some mechanism for routing queries or index probes to the "correct" lower level server. The index summaries are used to decide which lower level server may contain information relevant to a query. Pruning away contacts with lower level servers that are likely to be irrelevant to the query at hand is important for performance, and the strategies by which this is done differentiate between servers.

Distributed engines such as Harvest and HyPursuit have a number of advantages over centralized engines. Individually and collectively, these mean that distributed engines scale much better than centralized engines, while at the same time preserving local autonomy better. Interestingly, although the advantage of distributed systems is their scalability, in practice, centralized indices still index document collections that are 10-100 times larger. Advantages of distributed engines are:

Completeness and Currency

Unlike traditional databases, which comprise a closed world, in which a null answer means that the answer is "no", the WWW comprises an open world, in which a null answer may mean that the object could not be reached. As a result, all searches in the WWW should be treated as incomplete. One way to evaluate a search mechanism is by how close it comes to completeness of coverage of the information space. This is different than whether the index supports sophisticated searching techniques that allow a user to phrase the right query for what they really want; the issue here is whether the search engine can reach a particular document and whether as part of its internal categorization process it abstracts away any information in the interest of efficiency that subsequently make a document unreachable or a term unusable. There are several ways in which a document can be unreachable:

Note that private document spaces such as corporate document archives can be closed in this sense.

Openness and Interoperability

Ideally, search engines should be able to make use of indices, clusterings, and documents retrieved by other search engines. Likewise, a query engine should be able to use multiple catalogs in processing a query. It appears that distributed systems are more open to this; Harvest and HyPursuit both allow a higher level indexer to use content summaries created by any indexing mechanism as long as it provides the summary in a format standard. Unfortunately, the two systems seem to use different summary formats. The Stanford Digital Libraries project is building a common interface to several query engines. Interoperability work is all in its infancy.


Systems

The following indexers and search engines were found.

Alta Vista

This project was created at Digital Equipment Corporation and offers compact or detailed searches through a Web Index. DEC claims that this Web Index is the largest in existence. It also provides a full-text index of more than 13,000 newsgroups.

Deja News Research Service

This is a usenet news searching service which provides access to what it claims is the world's largest usenet news archive. There are several options available so that the user can tailor the search. For example, it has a "create a query filter" which can narrow the search to a particular newsgroup, date or author.

Excite

Excite by Architext searches a database of Web pages and claims to be smarter than Yahoo. The company claims that the database contains more than 1.5 million web documents and 40,000-plus Web sites. It also allows the user to browse through the past two weeks of Usenet news and classified ads. In addition it provides NetReviews - reviews of Internet resources and web sites.

Harvest

Harvest is a hierarchical, and to some extent, open search engine. There are two main objectives to Harvest: to radically reduce CPU and network bandwidth costs associated with creating and searching indices; and to support a variety of indexing strategies. Harvest does this via a hierarchical organization, moving computation close to the data servers, replication, caching, and compression. While important for performance, caching, replication, and compression seem fairly standard (although not for indexers).

"Gatherers" at data servers collect information about the content of the individual data servers. This allows individual document accesses to be local calls rather than network retrieval of entire documents as is customary in centralized engines. Summary information is compressed before shipment to higher level engines. In combination, this makes transmission of the summary information for a data server cheaper than the typical costs of retrieving a single document. Measured results indicate a better than 6,000:1 reduction in costs (combination of bandwidth, CPU, and disk) requirements for indexing a "typical" Archie site. Gatherers can index remote data if a site will not allow a Gatherer to be placed locally, but then the performance improvements are much smaller. Index summaries are much smaller than full-text indices such as used in WAIS, but perform nearly as well; the claim is nearly the same precision, 70% of the recall, and only 3-11% of the index size. The option of full text indexing exists if needed. The results of gathering can be shared by multiple higher level indexers called Brokers. This means that a site need only be indexed once in support of potentially many higher level indices.

Index creation can be customized. Examples given include hand-built indices and indexers that unpack nested structures such as compressed tars to extract function names from program source. Specialized indexers supported are Glimpse, which supports content queries, Boolean expressions, regular expressions, and approximate matching, while requiring only small indices; and Nebula, which supports views. Views are important to the Harvest search strategy because they make possible hierarchical classification schemes in which Brokers index particular topic areas. This can be used for browsing search a la Yahoo, or to constrain searches when performing queries.

A number of document types have been indexed; see Bowman et. al, Harvest: A Scalable, Customizable, Discovery and Access System . Harvest provides an HTML forms based interface that looks much like that of other search engines.

HyPursuit

HyPursuit is a hierarchical search engine developed at MIT. As of 1/96, a collection of 42 servers indexed 100 leaf WWW sites. The principal features of HyPursuit are space efficient indexing based on adjustable document clusters created by a bottom up clustering technique that uses both document terms and hyperlinks, pruning of the search space, and feedback mechanisms for query improvement.

Leaf indexers create document clusters for the documents they can access; these documents are relatively local to minimize network traffic. Clustering is based on a "document similarity function" that uses both document terms and hyperlinks. The similarity function is exposed and can be changed. Both terms and hyperlinks are used because both carry information content purportedly making the resulting automatically generated clusters closer to human generated clusters that rely on document semantics than other techniques. An experiment clustering CNN pages using the HyPursuit hybrid algorithm bears this out. The number of clusters at a server is adjustable by specifying to HyPursuit the amount of storage space to be allocated to indices and the desired cluster cohesion. Higher level servers cluster in the same way, treating cluster centroids reported from lower level servers as "documents" to be clustered. At each level, less relevant terms in the summaries are dropped to match resource availability.

Searches can begin at any level in the server hierarchy. A search will identify a collection of potentially relevant clusters. The search will then be directed at the servers that manage documents or sub-clusters of the identified clusters. This proceeds until leaf servers are reached, at which point documents are identified. At each level, more information about the clusters becomes known. As a result, pruning is less aggressive at the top levels, since it is not clear at that point that a cluster is not relevant.

HyPursuit uses a number of clever techniques to refine queries, including suggesting terms that broaden and narrow query or that are collocated with (in the same cluster as) query terms. In addition, summaries (called "content labels") can be browsed.

Infoseek Guide

Not reviewed.

Infoseek Professional

Not reviewed.

Lycos

Lycos supports keyword searching for WWW pages from a (mostly) automatically maintained catalog database. Unlike Yahoo, Lycos presents a flat search space rather than a hierarchy; thus, to search, you do not need to figure out how a page might have been categorized. Queries consist of keywords that are used to determine a numerical rating for how well each indexed document matches the query. The use of the keywords in computing the rating can be modified using a number of pull-down menus, which allow ANDing, ORing, m of n matching, and determining the tightness of match desired. Arbitrary Boolean queries are not supported because searches are posed against the index rather than the document, so negation is impossible to compute efficiently. However, individual keywords can be negated by being preceded with a "-", which causes pages containing that word to be given a lower rating.

Lycos manages an index database that the company claims has millions of link descriptors and documents built by a Web crawler that brings in thousands of documents daily. The indices are created and maintained using the SCOUT indexer. Lycos keeps a finite model of the web where each server is a root of a tree of files. Root servers are located by submission from users using a menu on the Lycos page. From each server, a crawler retrieves pages owned by that server. From each retrieved page, summary information is collected, including: title, headers, 100 most important words, first 20 lines, size of document in bytes, number of words, and number of references to the page from other pages. Incoming references are used by a pseudo-random search heuristic in which the spiders look at the most popular sites first, so that catalog entries for them are most current. In addition to document information, Lycos creates summary information about sets of documents (from a server) that is used in the search process. Server owners can place parts of the server's pages off-limits to Lycos by entries in a standard file that the crawler checks before retrieving pages. Lycos searches http, ftp, and gopher sites. It ignores Telnet, mailto, and WAIS. Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URL's.

See: (http://www.lycos.com/reference/slides.html ) (http://www.lycos.com/reference/faq.html)

Magellan - McKinley's Internet Directory

Magellan is an online directory of reviewed and rated Internet sites. The most important features of Magellan are that articles are reviewed and rated by McKinley, and that its query interface supports more capabilities than most. McKinley strives to add value to the raw indexing by their content reviews and ratings; in this sense, they are more like magazine publishers than any of the other engines, which are "value-free".

For each Internet resource, the following information is presented:

There are three indexes which the user can use to search for information: a key word index, audience field index, and category cross linking index indicated in each resource entry. Queries are quite rich, allowing proximity based queries, wildcarding, and use of related terms as well as the usual capabilities. The query engine ranks sites according to "relevancy". Magellan will soon be available in French, German, and Japanese.

Magellan uses full text retrieval by PLS, Internet access by Netcom, and is runs on Silicon Graphics workstations and servers. For general overview, see "ComputerLife", July 7, 1995. See also: (Magellan FAQ) (http://magellan.mckinley.com/mckinley-txt/ypbookad.html) (http://magellan.mckinley.com/mckinley-txt/ypbookad.html)

Open Text Index

Uses the Extreme Index retrieval engine. Indexes European and non-European languages such as Japanese and Arabic. Able to perform queries on multiple servers simultaneously. Automatically search Web sites to build and update the index. Able to index over 40 different types of files, including word processor, SGML, HTML and PDF.

TradeWave Galaxy

Not reviewed.

WebCrawler

WebCrawler presents an interface much like that of Lycos, allowing keyword searches of WWW pages from a (mostly) automatically maintained catalog database. WebCrawler presents a flat search space rather than a hierarchy; thus, to search, you do not need to figure out how a page might have been categorized. Queries consist of keywords that are used to determine a numerical rating for how well each indexed document matches the query. Unlike Lycos, the only modifications that can be made to the use of keywords is ANDing, ORing, and negation of terms. Here also, negation serves to decrease the value of a document prominently containing the term, rather than eliminating it as in a Boolean query. In addition to queries against the precomputed index, it is possible to issue real-time queries that actually perform WWW searches; however, these are not available to the general public since they would quickly swamp both WebCrawler and the Internet.

WebCrawler locates documents from a collection of known roots, and retrieves documents breadth first to the indexer, which extracts document titles, terms, and hit frequencies. Breadth first search is used in preference to depth first to avoid swamping individual servers. Within this general approach, searches can be random or directed toward new documents that appear to match particular keywords of interest. Hit frequencies are used in rating documents, and can be used in subsequent indexing cycles. One of the design objectives of WebCrawler was to support alternate search strategies (crawling for documents, not, apparently querying the index), so presumably WebCrawler's internal architecture is fairly open. WebCrawler is written in C and Objective-C for NEXTSTE, and uses a modified version of the WWW library from CERN. See WebCrawler Facts.

WebCrawler is now run by America Online; it is freely available as an Internet search tool.

See also: Second International WWW Conference '94 for an overview of the Web Crawler.

Who's Who on the Internet

A directory of Internet personalities. Not reviewed.

Yahoo

Yahoo features a hierarchically organized subject tree and was one of the first searching tools on the Internet. Users can either traverse the subject index via links, or search against the index using a simple keyword query capability. Queries can be restricted to sub-parts of the index, titles, Uniform Resource Locator (URL) addresses or comments. Boolean searches can also be performed. Search results are returned along with their locations within Yahoo's hierarchical index. A news summary page allows users to quickly scan current news and identify stories of interest. Complete articles can be retrieved by simply clicking on related story summaries. Yahoo! also highlights Internet sites that contain information about current events; Reuters News Service provides a news feed.

Entries in the Yahoo index are created manually by a reviewer. This is unlike other systems such as Lycos and Alta Vista where internal indices are created automatically and searched via queries. Yahoo reviewers find pages to index in two ways: by user submission from a button on the Yahoo page, and via a search engine that crawls web pages and returns documents. Most entries are currently collected by submission.

Yahoo is implemented mostly in Perl, and almost all done using public domain software that is available on the Internet.

See also: http://www.yahoo.com/docs/pr/credits.html


This research is sponsored by the Defense Advanced Research Projects Agency and managed by the U.S. Army Research Laboratory under contract DAAL01-95-C-0112. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, U.S. Army Research Laboratory, or the United States Government.

© Copyright 1996 Object Services and Consulting, Inc. Permission is granted to copy this document provided this copyright statement is retained in all copies. Disclaimer: OBJS does not warrant the accuracy or completeness of the information on this page.

This page was written by David Wells, with contributions by Ansu Kurien. Send questions and comments about this page to wells@objs.com.

Last updated: 04/22/96 7:45 PM

Back to Internet Tool Survey -- Back to OBJS