FROM THE WEB TO THE GLOBAL INFOBASE

Hector Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science
Stanford University

Contact Information

Hector Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science
Gates Hall 4A
Stanford University
Stanford, CA 94402
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom

WWW PAGE

None yet

Project Award Information

Keywords

World-Wide Web, information retrieval, database, natural-language processing, data mining

Project Summary

Our proposed work is driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal information resource, simple to use, up to date, and comprehensive. Towards this vision, we will develop technologies needed to transform today's World-Wide Web into the GIB. The project consists of four interrelated thrusts:

  1. Combining Technologies: We will integrate existing technologies for information retrieval, database management, and hypertext navigation, to achieve a "universal" information model and query language. We will begin our research by transferring operators from one type of system (e.g., proximity search from information retrieval) into other systems (e.g., relational databases). Ultimately, we will develop operators and information representations appropriate for all information, whether it is structured, unstructured, hyperlinked, or multimedia.
  2. Personalization: We will develop tools for personalizing information management, so that users obtain more relevant and timely information. We will investigate new mechanisms for extracting user preferences from access histories, home pages, bookmarks, and other sources. We will also implement new search and browsing algorithms that bias results in favor of user's interests.
  3. Semantics: Using natural-language processing, we will develop sophisticated tools for analyzing the semantics of Web pages and their interrelationships. We will investigate how to apply such tools efficiently over large volumes of information, and we will develop technology for tracking the lineage or provenance of Web pages.
  4. Data Mining: We will design new algorithms for mining information on the Web in order to synthesize new knowledge. We will start by developing algorithms for learning facts from the Web, based on an initial set of sample facts. We will also develop new clustering algorithms that scale to the Web.

Publications and Products

The Global InfoBase project is only a few months old. The following list represents recent publications by the PI and co-PIs related to the GIB.

Project Impact

The project is funding several graduate students with an expectation that their work in the project will lead to Ph.D. theses. We expect to involve several undergraduate students during the lifetime of the project, for course units or funding. Currently we will be supporting 1-2 undergraduates to perform research within the project during summer 2001 through Stanford's new CURIS (Computer Science Undergraduate Research Internship) program. Currently we are collaborating with the IBM Almaden Research Center, and we expect other industrial collaborations to develop as the project matures.

Goals, Objectives, and Targeted Activities

Current and near-term future work is described here in terms of the four thrusts specified in the Project Summary above. In addition, a significant ongoing effort will be continuous integration of the work in the four thrusts into a cohesive whole -- all four technologies must work together to create an effective Global InfoBase.

  1. Combining Technologies: Our current focus is on the problem of computing and exploiting "similarity" measures in the context of the GIB. Similarity plays an important role in a universal information resource. For example, users may have identified a set of Web pages or other data items of interest, and they may wish to find other pages or items that are "most similar." Or users may wish to cluster sets of pages or items based on their similarity, for browsing purposes. Although the similarity problem has been studied in the context of text documents, for example, we are looking at the problem in two very new ways -- in both cases, we are exploiting structure to obtain more accurate similarity results. In one area, we are investigating ways to compute similarity of sets of items when there is a superimposed hierarchy on the items, e.g., groceries purchased or college classes taken. The hierarchy allows us to compute much more refined similarity measures than traditional methods based on set intersection. In another area, we are investigating using connectivity of Web pages or data items to determine similarity. The basic idea is that two items are similar if they are connected to other items that are similar. Although the definition is obviously recursive, it can be evaluated based on Eigenvectors, again producing what we believe to be more refined similarity results. In both areas of work we are developing a variety of algorithms, implementing them, and we plan to perform scalability and sensitivity experiments as well as user studies to show their effectiveness when compared with simpler methods.
  2. Personalization: We are currently exploring how to adapt search and ranking mechanisms to take into account user preferences. For example, the "page rank" algorithm used by the Google search engine (www.google.com) can be modified to give bookmark pages additional weight in the recursive page rank evaluation. We expect that our work in similarity as described above will provide additional tools for personalization. For example, based on similarity measures we can provide users with pages or items of potential interest automatically, and we can perform sophisticated collaborative filtering.
  3. Semantics: Currently we are attacking the problem of how to get from syntactic analyses of sentences to semantic representations which can be used for information integration over Web pages. A particular focus is working on the semantic identity problem: that is, determining when references to "C. D. Manning" and "Christopher Manning", or just "Chris" refer to the same person, and when they don't. Natural-language processing techniques are especially promising for this problem, because its solution necessarily requires context-specific probabilistic decision making, which additionally makes use of appropriate prior knowledge. The result is the ability to create database records on the fly: for instance, information such as a person's job title, contact details, and responsibilities. This work extends information extraction or wrapper generation work in that the relation(s) to be captured are not fixed in advance.
  4. Data Mining: Our current focus is on automating the subjective evaluation of strategies for clustering Web documents by topic. The space of alternatives tried includes mixtures of words contained in the documents, words in the "anchor text" leading to the document, and words surrounding the anchor. To convert the subjective to something that can be measured without human intervention, different measures of document similarity are evaluated on the assumption that the open directory (dmoz.org) represents ground truth. There, document authors have subjectively classified their documents in a way that lets us estimate the similarity of documents by how close their lowest common ancestor in the hierarchy is. The conclusion of the study is that the best of the tried approaches are those that look at text surrounding the anchor, with a weight for closeness to the anchor.

Project References

Since the project is in its very early stages, currently the best references are the papers listed under Publications and Products above.

Area Background

The World-Wide Web has created a resource comprising much of the world's knowledge and is incorporating a progressively larger fraction each year. Ideally, what is known by one should be known by (or at least available to) all. Yet today our ability to use the Web as an information resource -- whether to advance science, to enhance our lives, or just to get the best deal on a videotape -- is in a primitive state. Information on the Web is often hard to find and may be of dubious quality. Although information is presented in a universal HTML format, there are many fundamental differences across sites: words have different meanings, information is structured differently or not at all, and query interfaces vary widely. There are few ways to protect intellectual property on the Web, and the privacy of users is frequently compromised.

In spite of these shortcomings, at this point it is not practical to replace the Web as the underlying framework for sharing information and interacting with people. Thus, our ultimate goal is not to replace the Web with a new information resource, but rather to add functionality and tools to the Web -- to transform it into the Global InfoBase. Achieving this goal is by no means an easy task, but we believe that we can make fundamental advances in how information is created, searched, integrated, and managed on the Web, bringing us significantly closer to the GIB vision.

Area References

Please see Publications and Products above.