Searching and Indexing

Introduction

To locate information in the WWW by any means other than naive hyperlink traversal requires the use of an index or some sort of query capability. These two topics are closely related, since an index can be thought of as a cached query, and indeed, many indices are created by using web crawlers that have much in common with query engines (e.g., similar traversal strategies and building of temporary indices). Further, some indices allow queries to be applied to their contents, while others simply are organized collections of hyperlinks.

Regardless of style, locating information in a document space (in this case the WWW) ultimately partitions the document space into two partitions: documents retrieved and documents not retrieved. Each of these partitions may contain both relevant and irrelevant documents and all shades of partial relevance in between. Precision measures how well the retrieved documents meet the needs of the user; i.e., how many irrelevant documents were retrieved. Recall measures how many of the relevant documents were actually retrieved. To improve precision, retrieved documents are often ranked by a presumed measure of relevance. To accurately measure precision and recall requires omniscient knowledge of the all document contents and the user's actual (not necessarily stated) needs. Needless to say, actual systems can only estimate precision and recall. The aim of any information retrieval system (web-based or otherwise) is to maximize precision and recall at reasonable cost, while minimizing complexity for the user.

These goals are of course mutually contradictory:

improving precision, even within a single system, generally decreases recall, and vice versa;
exhaustive search yields higher recall, but at higher cost;
feedback mechanisms can improve precision and recall, but at added complexity and cost;
timeliness of information (e.g., not relying on outdated indices) improves both precision and recall, but at a higher cost.

As such, systems generally tend to emphasize some goals more than others. As a trivial example, the list of references for a journal article (hopefully) has high precision for relevant related articles because it is hand constructed by an expert, will naturally not include many other relevant pieces of information (low recall), is expensive to build (for its size) because a human is involved in creating it, and has low complexity. Other kinds of mechanisms fall elsewhere in the design space.

For a general overview of indexing issues, see InfoSeek: Searching the Web. For a nice glossary of WWW terms, see NCSA Glossary.

Design Space

Tools for locating information can be classified by several different criteria:

Information Space & Query Models
How Information Becomes Organized
How the Indexer Finds Documents to Index
Views
Query Refinement & Querying Queries
Centralized vs. Distributed
Completeness

Information Space & Query Models

This refers to how the tool and the user model and manipulate the information space. There are three considerations: how the tool presents information to the user, how information is represented inside the tool, and whether these two views are the same.

Tools are generally acted upon by the user as either an index to be browsed or as an interface through which queries (not necessarily SQL-like) may by formed and submitted. Tools such as Yahoo that present as an index invariably maintain internal indices. If a query interface is presented, an internal index may or may not be maintained. However, tools that manage large systems almost always create indices for internal use to reduce the costs of query evaluation. The internal and external views are not necessarily the same. Yahoo both stores and (primarily) presents as indices; Harvest and HyPursuit, construct and maintain indices for efficiency but present a query interface to the user.

An interesting alternative to subject-based searches are "social" searches [Erickson, CACM, 1/96, pg. 15], in which searching is based on profiles (stored or known) of "who would be likely to know about X" that are used as starting points for traversals. This is mechanically very similar to the use of subject indices, the difference being that non-documents or untraditional documents (personal pages) are used instead of subject indices.

How Information Becomes Organized

Querying or browsing implies that at some point the information space gets partitioned. This can be done in several different ways.

The information space may be organized manually by humans; this corresponds to edited collections and is generally confined to relatively small, cohesive collections within a larger information space. An interesting point made at a recent DTII meeting is to not underestimate the value of editors as a form of "information retailer" that repackages "wholesale" information available in digital libraries or on the WWW.

Alternatively, the information space may be automatically clustered using a variety of techniques. Most common are keyword, term occurrence in text, and hyperlink graph structure. Combinations of these clustering criteria are sometimes used. The clusters may then be indexed based on some function of the clustering criteria (e.g., most common terms).

Finally, there are hybrid approaches that use web crawlers to find documents fitting certain coarse groupings as the basis for an index that is then refined by a human. For example, a crawler could locate and sort documents that contained any of a collection of medical terms; the (presumably relevant) documents would then be reviewed to determine if and how they should fit into a medical index. Such indices can be quite large (e.g., Yahoo), but are labor-intensive and take longer to populate/maintain than fully automatic indices. On the other hand, they are more likely to have high precision and are capable (in principle) of capturing nuances in the documents that cannot be detected automatically with current technology.

How the Indexer Finds Documents to Index

Regardless of whether documents are indexed by a human or automatically, the documents must first be found somehow. For large indices, this is the job of a web robot (agent). The behavior of the robot has an enormous impact on the coverage of the resultant index, its currency, and whether the system is perceived as a good "Internet citizen".

Robots need to be designed with several main robot design principles in mind. Generally, web robots start with a collection of known documents and traverse links from these documents to find servers with new documents to index. These documents are then retrieved to the indexer for processing. The main concerns in web robot design are to not swamp network bandwidth, not swamp individual servers, and to respect the privacy of servers. Bandwidth conservation is done in several ways: not retrieving documents multiple times, only retrieving indexable document types (e.g., don't retrieve .gif files if you don't index pictures), off-hours retrieval, and minimizing the distance documents must travel (distributed indexers do this by processing at or near the server). Avoiding swamping the server is done by breadth-first searches (so as not to retrieve all documents from a single server in sequence), not multiply retrieving a document, and re-indexing on an appropriate schedule that does the most popular (and therefore probably most capable) sites first. Respecting the privacy of a server is done by allowing a way for the server to place itself (partially) off-limits to the robot and by providing a feedback channel.

Views

Views are static, periodically re-evaluated queries that serve to provide contexts against which more specific queries can be posed. For example, a view could be used to create an index of only "computer science" documents, which could then be searched for "channels". Views are usually created at servers (i.e., users don't create them). Views help to disambiguate vocabulary when terms have different meanings in different contexts. Thus, in addition to speeding up retrieval, the use of specialized views tends to make search results more precise.

Query Refinement & Querying Queries

A problem common to all information retrieval systems, whether Web-based or not, is phrasing a query that returns the right amount of the desired information. This means both getting the query predicate and search target correct and getting the right precision and recall given the user's desire at that time. While a user certainly can keep submitting new queries until a more or less "right" result is returned, the more usual way of doing things is to have the search engine provide some mechanisms for tuning the query or re-querying previously obtained results. This has benefits in that it is easier for the user to modify an existing query than to create a new one, and because it possibly reduces the computation required to evaluate subsequent queries.

We first examine several techniques to present query refinement to the user, and then examine how these are supported inside the search engine and look at the potential performance implications of query refinement.

Presentation

Several different approaches have been taken to query improvement in WWW search engines. A given engine normally supports more than one of these features, since in general they have different objectives.

Caching the query. This is a basic, but very useful technique supported by virtually all search engines. Once the query is cached, it can be changed incrementally. This is more powerful than a user "cut and paste", since it applies to query formation models such as menu selection, sliders, or buttons that are not amenable to normal "cut and paste". In addition, the search engine can possibly use the knowledge that the new query is a modification of a query already processed to reduce processing (see below).
Fuzzy (similarity) vs. exact matching. Most WWW query engines attempt to assign a "goodness" rather than a Boolean value to whether a particular document matches a query. This fits well with keyword or full-text matching queries, in which the coincidental use of a term in a document does not imply that the document is actually "about" that term. Most engines have some way for the user to specify whether an exact or approximate match is desired.
Tightening or loosening the match criteria. When fuzzy matching is used, sliders (or similar) can be used to change the precision and recall criteria for a query. In effect, these change the threshold at which a document is assumed to match the existing predicate. This is expressible numerically or by a set of terms (e.g., tight match, loose match, etc. as in Lycos).
m of n matching. Lycos offers the user the ability to specify that a document matches if m of n terms in a "conjunctive" predicate match the document.
Term weighting. Terms in fuzzy predicates can be assigned weights to specify how much they contribute to the determination of a match. Numeric values, sliders, or fuzzy terms are used in various systems.
Querying query results. The ability to issue a query against an existing result set provides an intuitive way to refine a query and also, depending on the search strategy employed by the engine, may reduce processing costs. Many systems support this feature.
Query feedback. Often a query engine will know which part of a predicate caused precision or recall to become unacceptable (i.e., why the result set was empty or huge). For example, a predicate containing the disjunctive of "computer" is likely to produce a very big result set.
Query improvement. The CoBase project is looking at ways to re-write queries with user cooperation to improve the query interactively. The basic idea is without knowledge of the schema (which users are unlikely to have or want), it is difficult to write a query that asks what you want. CoBase is exploring techniques to use the system's knowledge of the schema to suggest "improvements" to the submitted query, along with an explanation for why the change was suggested.
Ontologies. Ontologies that maintain information of terms that specialize other terms (e.g., a robin is a bird), related terms (e.g., football is related to helmet), or containment relationships (e.g., a car contains seats) can be used to suggest terms that can tighten or loosen a predicate. Lists of alternatives for the terms in the current predicate can be suggested, along with an indication whether they are likely to narrow or broaden the query. HyPursuit provides this. Thesauri can be used in the same way.

Internals and Performance

It is often cheaper for a query engine to process a refined query than to start over with a presumably new query. This is one advantage to supporting query modification rather than letting the user do it on an ad hoc basis.

Querying query results allows the search to take place over a restricted set and is potentially much more efficient than a completely new search over the entire space. This becomes particularly important in the WWW because communications bandwidth plays such a significant role. Querying a query result effectively can trim away fruitless probes to servers already determined not to contain any elements of the result set
When term weighting, m of n matching, or tightening/loosening the match criteria are supported, the search engine can often re-rank previously collected documents, to short-circuit a complete search. Since potentially matching documents were evaluated to determine if they actually met the criteria, even documents that failed a previous query have been located in a previous search and thus can be accessed more cheaply a second time around.
It may be more efficient to re-plan an existing query for which a query plan already exists than to optimize a new query from scratch. This is particularly true if the query modifications are small. Although we did not find any systems that do this, it is possible that a system that suggests query modifications might be better able to incrementally re-optimize.

Centralized vs. Distributed

Indices can be maintained and searches performed either at a centralized location by a single engine, or at distributed locations by multiple engines. In either case, indices and engines can be replicated to provide greater throughput. Replication does not change the basic premise that in one case all information is available at a single site whereas in the other case it is not. In centralized engines, indices are built by doing a (generally breadth first) retrieval of documents reached from some collection of initial root documents. Retrieved documents are brought to the server, where summary information (terms and links) are extracted. Needless to say this is expensive in terms of bandwidth. Links extracted from documents are used for further searches; pruning may occur to prevent documents from being retrieved multiple times or to trim links that emanate from "uninteresting" documents.

The general strategy in a distributed engine is that leaf servers index (or search) documents as if each were a single, centralized server. Summaries of these indices are reported up to higher level indices until a top level index is reached. The contents of the summary and the way in which it is determined is an important variant between distributed engines. Major variants are to report all terms sufficiently important to be included in the local index, or to cluster the indexed documents and report the cluster summaries or cluster centroids (HyPursuit). In the latter case, a cluster is treated by the higher level server as a "document" to be indexed or clustered in turn, thus making the whole process recurse. Note that distributed engines are not necessarily hierarchical, since there is no inherent need for a strict partitioning of the indices; in fact, DAGS are a more common configuration.

Distributed engines all have some mechanism for routing queries or index probes to the "correct" lower level server. The index summaries are used to decide which lower level server may contain information relevant to a query. Pruning away contacts with lower level servers that are likely to be irrelevant to the query at hand is important for performance, and the strategies by which this is done differentiate between servers.

Distributed engines such as Harvest and HyPursuit have a number of advantages over centralized engines. Individually and collectively, these mean that distributed engines scale much better than centralized engines, while at the same time preserving local autonomy better. Interestingly, although the advantage of distributed systems is their scalability, in practice, centralized indices still index document collections that are 10-100 times larger. Advantages of distributed engines are:

Reduced processing load during query evaluation. The most obvious advantage is that it spreads the processing and storage workload around, since no single server will need to maintain an enormous index or perform massive searches.
Efficient index construction. Servers generally index documents (relatively) local to themselves, reporting only summaries to higher level indices. Thus, index construction takes less bandwidth since faraway sites need not be probed during index construction.
Parallelism. Multiple servers support parallelism during index construction and query evaluation.
Tolerance of change. Distributed indices are in general more tolerant of changes to document content. Because lower level indices report only summaries up to higher level indices, changes to lower level indices do not always propagate.
Localized searches. Because each server acts locally as if it were a top-level server (the only difference is that it also reports up to its parent(s)), it can be used as the entry point for a search. This makes distributed search engines convenient for searching local document collections.
Local autonomy. Indexers usually cannot cross firewalls or access password protected pages (see Completeness). As such, they cannot find many documents. A hierarchical engine allows a local index to be made under local control, with the summaries being propagated upwards as usual. This lets a trusted engine do the local work, but allows the protected documents to be located (even if not accessed) from a top level query or index.
Diverse indexing strategies. Because each server reports only summaries and accepts requests in a standardized form, it is possible to have radically different mechanisms for constructing indices (see How Indices Get Organized). Among other things, this allows the same documents to be indexed according to different criteria, supporting different edited collections over the same document space.

Completeness and Currency

Unlike traditional databases, which comprise a closed world, in which a null answer means that the answer is "no", the WWW comprises an open world, in which a null answer may mean that the object could not be reached. As a result, all searches in the WWW should be treated as incomplete. One way to evaluate a search mechanism is by how close it comes to completeness of coverage of the information space. This is different than whether the index supports sophisticated searching techniques that allow a user to phrase the right query for what they really want; the issue here is whether the search engine can reach a particular document and whether as part of its internal categorization process it abstracts away any information in the interest of efficiency that subsequently make a document unreachable or a term unusable. There are several ways in which a document can be unreachable:

The document is in a form that cannot be interpreted by the search mechanism. For example, a document might use non-standard HTML and be unreadable.
The document may be password protected or behind a firewall. Such documents may be reachable by a user with appropriate authorization, but not findable by the search engine.
Document collections may be indexed by proprietary indexers; if the search engine cannot communicate with these "external" (to it) indices, the information may not be located, even though an index exists for it.
If all searching is via indices, only indexed terms will lead to locating documents. If the indexing mechanism determines that a particular term is insignificant enough, it may be dropped from the index and thus a search using that term will not result in a hit. This is potentially aggravated in hierarchical search engines, in which there are several levels at which terms can get dropped. A similar issue occurs if only document titles rather than whole documents are indexed.
Newly created documents may not yet be indexed or otherwise located by the engine. The efficiency of the engine has a big impact on how quickly documents are findable using it.
Computed pages (e.g., from CGI scripts or database queries) are not indexed.

Note that private document spaces such as corporate document archives can be closed in this sense.

Openness and Interoperability

Ideally, search engines should be able to make use of indices, clusterings, and documents retrieved by other search engines. Likewise, a query engine should be able to use multiple catalogs in processing a query. It appears that distributed systems are more open to this; Harvest and HyPursuit both allow a higher level indexer to use content summaries created by any indexing mechanism as long as it provides the summary in a format standard. Unfortunately, the two systems seem to use different summary formats. The Stanford Digital Libraries project is building a common interface to several query engines. Interoperability work is all in its infancy.

Systems

The following indexers and search engines were found.

Alta Vista
Deja News
Excite
Harvest
HyPursuit
InfoSeek Guide
InfoSeek Professional
Lycos
Magellan
Open Text
TradeWave Galaxy
WebCrawler
Who's Who on the Internet
Yahoo!

Alta Vista

This project was created at Digital Equipment Corporation and offers compact or detailed searches through a Web Index. DEC claims that this Web Index is the largest in existence. It also provides a full-text index of more than 13,000 newsgroups.

Deja News Research Service

This is a usenet news searching service which provides access to what it claims is the world's largest usenet news archive. There are several options available so that the user can tailor the search. For example, it has a "create a query filter" which can narrow the search to a particular newsgroup, date or author.

Excite

Excite by Architext searches a database of Web pages and claims to be smarter than Yahoo. The company claims that the database contains more than 1.5 million web documents and 40,000-plus Web sites. It also allows the user to browse through the past two weeks of Usenet news and classified ads. In addition it provides NetReviews - reviews of Internet resources and web sites.

Harvest

Harvest is a hierarchical, and to some extent, open search engine. There are two main objectives to Harvest: to radically reduce CPU and network bandwidth costs associated with creating and searching indices; and to support a variety of indexing strategies. Harvest does this via a hierarchical organization, moving computation close to the data servers, replication, caching, and compression. While important for performance, caching, replication, and compression seem fairly standard (although not for indexers).

"Gatherers" at data servers collect information about the content of the individual data servers. This allows individual document accesses to be local calls rather than network retrieval of entire documents as is customary in centralized engines. Summary information is compressed before shipment to higher level engines. In combination, this makes transmission of the summary information for a data server cheaper than the typical costs of retrieving a single document. Measured results indicate a better than 6,000:1 reduction in costs (combination of bandwidth, CPU, and disk) requirements for indexing a "typical" Archie site. Gatherers can index remote data if a site will not allow a Gatherer to be placed locally, but then the performance improvements are much smaller. Index summaries are much smaller than full-text indices such as used in WAIS, but perform nearly as well; the claim is nearly the same precision, 70% of the recall, and only 3-11% of the index size. The option of full text indexing exists if needed. The results of gathering can be shared by multiple higher level indexers called Brokers. This means that a site need only be indexed once in support of potentially many higher level indices.

Index creation can be customized. Examples given include hand-built indices and indexers that unpack nested structures such as compressed tars to extract function names from program source. Specialized indexers supported are Glimpse, which supports content queries, Boolean expressions, regular expressions, and approximate matching, while requiring only small indices; and Nebula, which supports views. Views are important to the Harvest search strategy because they make possible hierarchical classification schemes in which Brokers index particular topic areas. This can be used for browsing search a la Yahoo, or to constrain searches when performing queries.

A number of document types have been indexed; see Bowman et. al, Harvest: A Scalable, Customizable, Discovery and Access System . Harvest provides an HTML forms based interface that looks much like that of other search engines.

HyPursuit

HyPursuit is a hierarchical search engine developed at MIT. As of 1/96, a collection of 42 servers indexed 100 leaf WWW sites. The principal features of HyPursuit are space efficient indexing based on adjustable document clusters created by a bottom up clustering technique that uses both document terms and hyperlinks, pruning of the search space, and feedback mechanisms for query improvement.

Leaf indexers create document clusters for the documents they can access; these documents are relatively local to minimize network traffic. Clustering is based on a "document similarity function" that uses both document terms and hyperlinks. The similarity function is exposed and can be changed. Both terms and hyperlinks are used because both carry information content purportedly making the resulting automatically generated clusters closer to human generated clusters that rely on document semantics than other techniques. An experiment clustering CNN pages using the HyPursuit hybrid algorithm bears this out. The number of clusters at a server is adjustable by specifying to HyPursuit the amount of storage space to be allocated to indices and the desired cluster cohesion. Higher level servers cluster in the same way, treating cluster centroids reported from lower level servers as "documents" to be clustered. At each level, less relevant terms in the summaries are dropped to match resource availability.

Searches can begin at any level in the server hierarchy. A search will identify a collection of potentially relevant clusters. The search will then be directed at the servers that manage documents or sub-clusters of the identified clusters. This proceeds until leaf servers are reached, at which point documents are identified. At each level, more information about the clusters becomes known. As a result, pruning is less aggressive at the top levels, since it is not clear at that point that a cluster is not relevant.

HyPursuit uses a number of clever techniques to refine queries, including suggesting terms that broaden and narrow query or that are collocated with (in the same cluster as) query terms. In addition, summaries (called "content labels") can be browsed.

Infoseek Guide

Not reviewed.

Infoseek Professional

Not reviewed.

Lycos

Lycos supports keyword searching for WWW pages from a (mostly) automatically maintained catalog database. Unlike Yahoo, Lycos presents a flat search space rather than a hierarchy; thus, to search, you do not need to figure out how a page might have been categorized. Queries consist of keywords that are used to determine a numerical rating for how well each indexed document matches the query. The use of the keywords in computing the rating can be modified using a number of pull-down menus, which allow ANDing, ORing, m of n matching, and determining the tightness of match desired. Arbitrary Boolean queries are not supported because searches are posed against the index rather than the document, so negation is impossible to compute efficiently. However, individual keywords can be negated by being preceded with a "-", which causes pages containing that word to be given a lower rating.

Lycos manages an index database that the company claims has millions of link descriptors and documents built by a Web crawler that brings in thousands of documents daily. The indices are created and maintained using the SCOUT indexer. Lycos keeps a finite model of the web where each server is a root of a tree of files. Root servers are located by submission from users using a menu on the Lycos page. From each server, a crawler retrieves pages owned by that server. From each retrieved page, summary information is collected, including: title, headers, 100 most important words, first 20 lines, size of document in bytes, number of words, and number of references to the page from other pages. Incoming references are used by a pseudo-random search heuristic in which the spiders look at the most popular sites first, so that catalog entries for them are most current. In addition to document information, Lycos creates summary information about sets of documents (from a server) that is used in the search process. Server owners can place parts of the server's pages off-limits to Lycos by entries in a standard file that the crawler checks before retrieving pages. Lycos searches http, ftp, and gopher sites. It ignores Telnet, mailto, and WAIS. Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URL's.

See: (http://www.lycos.com/reference/slides.html ) (http://www.lycos.com/reference/faq.html)

Magellan - McKinley's Internet Directory

Magellan is an online directory of reviewed and rated Internet sites. The most important features of Magellan are that articles are reviewed and rated by McKinley, and that its query interface supports more capabilities than most. McKinley strives to add value to the raw indexing by their content reviews and ratings; in this sense, they are more like magazine publishers than any of the other engines, which are "value-free".

For each Internet resource, the following information is presented:

Title of Resource - assigned by McKinley if necessary.
McKinley Star Rating - 1 -4 star rating by The McKinley Group's editors, based on completeness of content, presentation, organization, currency of information and ease of access.
Description - a McKinley generated summary of the document.
Audience - Who might appreciate this article. Additional audience groups are provided to help the reader expand the search into other areas of the directory.
Contact - Name(s) and e-mail address(es) are provided whenever possible, allowing readers to follow up with content providers.
Commercial - An icon indicates whether a site is commercial or non-commercial.
URLs - Each resource is followed by one or more addresses for the site.

There are three indexes which the user can use to search for information: a key word index, audience field index, and category cross linking index indicated in each resource entry. Queries are quite rich, allowing proximity based queries, wildcarding, and use of related terms as well as the usual capabilities. The query engine ranks sites according to "relevancy". Magellan will soon be available in French, German, and Japanese.

Magellan uses full text retrieval by PLS, Internet access by Netcom, and is runs on Silicon Graphics workstations and servers. For general overview, see "ComputerLife", July 7, 1995. See also: (Magellan FAQ) (http://magellan.mckinley.com/mckinley-txt/ypbookad.html) (http://magellan.mckinley.com/mckinley-txt/ypbookad.html)

Open Text Index

Uses the Extreme Index retrieval engine. Indexes European and non-European languages such as Japanese and Arabic. Able to perform queries on multiple servers simultaneously. Automatically search Web sites to build and update the index. Able to index over 40 different types of files, including word processor, SGML, HTML and PDF.

TradeWave Galaxy

Not reviewed.

WebCrawler

WebCrawler presents an interface much like that of Lycos, allowing keyword searches of WWW pages from a (mostly) automatically maintained catalog database. WebCrawler presents a flat search space rather than a hierarchy; thus, to search, you do not need to figure out how a page might have been categorized. Queries consist of keywords that are used to determine a numerical rating for how well each indexed document matches the query. Unlike Lycos, the only modifications that can be made to the use of keywords is ANDing, ORing, and negation of terms. Here also, negation serves to decrease the value of a document prominently containing the term, rather than eliminating it as in a Boolean query. In addition to queries against the precomputed index, it is possible to issue real-time queries that actually perform WWW searches; however, these are not available to the general public since they would quickly swamp both WebCrawler and the Internet.

WebCrawler locates documents from a collection of known roots, and retrieves documents breadth first to the indexer, which extracts document titles, terms, and hit frequencies. Breadth first search is used in preference to depth first to avoid swamping individual servers. Within this general approach, searches can be random or directed toward new documents that appear to match particular keywords of interest. Hit frequencies are used in rating documents, and can be used in subsequent indexing cycles. One of the design objectives of WebCrawler was to support alternate search strategies (crawling for documents, not, apparently querying the index), so presumably WebCrawler's internal architecture is fairly open. WebCrawler is written in C and Objective-C for NEXTSTE, and uses a modified version of the WWW library from CERN. See WebCrawler Facts.

WebCrawler is now run by America Online; it is freely available as an Internet search tool.

See also: Second International WWW Conference '94 for an overview of the Web Crawler.

Who's Who on the Internet

A directory of Internet personalities. Not reviewed.

Yahoo

Yahoo features a hierarchically organized subject tree and was one of the first searching tools on the Internet. Users can either traverse the subject index via links, or search against the index using a simple keyword query capability. Queries can be restricted to sub-parts of the index, titles, Uniform Resource Locator (URL) addresses or comments. Boolean searches can also be performed. Search results are returned along with their locations within Yahoo's hierarchical index. A news summary page allows users to quickly scan current news and identify stories of interest. Complete articles can be retrieved by simply clicking on related story summaries. Yahoo! also highlights Internet sites that contain information about current events; Reuters News Service provides a news feed.

Entries in the Yahoo index are created manually by a reviewer. This is unlike other systems such as Lycos and Alta Vista where internal indices are created automatically and searched via queries. Yahoo reviewers find pages to index in two ways: by user submission from a button on the Yahoo page, and via a search engine that crawls web pages and returns documents. Most entries are currently collected by submission.

Yahoo is implemented mostly in Perl, and almost all done using public domain software that is available on the Internet.

This research is sponsored by the Defense Advanced Research Projects Agency and managed by the U.S. Army Research Laboratory under contract DAAL01-95-C-0112. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, U.S. Army Research Laboratory, or the United States Government.

© Copyright 1996 Object Services and Consulting, Inc. Permission is granted to copy this document provided this copyright statement is retained in all copies. Disclaimer: OBJS does not warrant the accuracy or completeness of the information on this page.

This page was written by David Wells, with contributions by Ansu Kurien. Send questions and comments about this page to wells@objs.com.

Last updated: 04/22/96 7:45 PM

Back to Internet Tool Survey -- Back to OBJS

Searching and Indexing

Table of Contents

Information Space & Query Models

How the Indexer Finds Documents to Index

Views

Query Refinement & Querying Queries

Presentation

Internals and Performance

Centralized vs. Distributed

Completeness and Currency

Openness and Interoperability