FROM THE WEB TO THE GLOBAL INFOBASE
Hector Garcia-Molina (PI), Chris Manning (co-PI), Jeff
Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science
Stanford University
Contact Information
Hector Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer
Widom
Dept. of Computer Science
Gates Hall 4A
Stanford University
Stanford, CA 94402
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom
WWW PAGE
None yet
Project Award Information
- Award Number: IIS-0085896
- Duration: 9/1/00-8/31/03
- Title: From the Web to the
Global InfoBase
Keywords
World-Wide Web, information retrieval, database, natural-language
processing, data mining
Project Summary
Our proposed work is driven by the vision of a Global InfoBase (GIB): a
ubiquitous and universal information resource, simple to use, up to date, and
comprehensive. Towards this vision, we will develop technologies needed to
transform today's World-Wide Web into the GIB. The project consists of four
interrelated thrusts:
- Combining Technologies: We
will integrate existing technologies for information retrieval, database
management, and hypertext navigation, to achieve a "universal"
information model and query language. We will begin our research by
transferring operators from one type of system (e.g., proximity search
from information retrieval) into other systems (e.g., relational databases).
Ultimately, we will develop operators and information representations
appropriate for all information, whether it is structured, unstructured,
hyperlinked, or multimedia.
- Personalization: We will
develop tools for personalizing information management, so that users
obtain more relevant and timely information. We will investigate new
mechanisms for extracting user preferences from access histories, home
pages, bookmarks, and other sources. We will also implement new search and
browsing algorithms that bias results in favor of user's interests.
- Semantics: Using
natural-language processing, we will develop sophisticated tools for
analyzing the semantics of Web pages and their interrelationships. We will
investigate how to apply such tools efficiently over large volumes of
information, and we will develop technology for tracking the lineage or
provenance of Web pages.
- Data Mining: We will design
new algorithms for mining information on the Web in order to synthesize
new knowledge. We will start by developing algorithms for learning facts
from the Web, based on an initial set of sample facts. We will also
develop new clustering algorithms that scale to the Web.
Publications and Products
The Global InfoBase project is only a few months old. The following list
represents recent publications by the PI and co-PIs related to the GIB.
- A. Arasu, J. Cho, H.
Garcia-Molina, A. Paepcke, S. Raghavan. Searching the Web. ACM
Transactions on Internet Technologies, to appear.
- M. Chavira, D.-K. Wong, and
C. Manning. Applying Hierarchical Classification Techniques without a
Hierarchy. Technical Report, Stanford University, January 2001.
- T. Haveliwala, A. Gionis, D.
Klein, and P. Indyk. Similarity Search on the Web: Evaluation and
Scalability Considerations. Technical Report, Stanford University,
February 2001.
- R. Goldman and J. Widom.
WSQ/DSQ: A Practical Approach for Combined Querying of Databases and the
Web. Proceedings of the ACM SIGMOD International Conference on Management
of Data, pages 285-296, Dallas, Texas, May 2000.
- R. Goldman, N. Shivakumar, S.
Venkatasubramanian, and H. Garcia-Molina. Proximity Search in Databases.
Proceedings of the Twenty-Fourth International Conference on Very Large
Data Bases, New York, August 1998.
Project Impact
The project is funding several graduate students with an expectation that
their work in the project will lead to Ph.D. theses. We expect to involve
several undergraduate students during the lifetime of the project, for course
units or funding. Currently we will be supporting 1-2 undergraduates to perform
research within the project during summer 2001 through Stanford's new CURIS
(Computer Science Undergraduate Research Internship) program. Currently we are
collaborating with the IBM Almaden Research Center, and we expect other
industrial collaborations to develop as the project matures.
Goals, Objectives, and Targeted Activities
Current and near-term future work is described here in terms of the four
thrusts specified in the Project Summary above. In addition, a significant
ongoing effort will be continuous integration of the work in the four thrusts
into a cohesive whole -- all four technologies must work together to create an
effective Global InfoBase.
- Combining Technologies: Our
current focus is on the problem of computing and exploiting
"similarity" measures in the context of the GIB. Similarity
plays an important role in a universal information resource. For example,
users may have identified a set of Web pages or other data items of
interest, and they may wish to find other pages or items that are
"most similar." Or users may wish to cluster sets of pages or
items based on their similarity, for browsing purposes. Although the
similarity problem has been studied in the context of text documents, for
example, we are looking at the problem in two very new ways -- in both
cases, we are exploiting structure to obtain more accurate similarity
results. In one area, we are investigating ways to compute similarity of
sets of items when there is a superimposed hierarchy on the items, e.g.,
groceries purchased or college classes taken. The hierarchy allows us to
compute much more refined similarity measures than traditional methods
based on set intersection. In another area, we are investigating using
connectivity of Web pages or data items to determine similarity. The basic
idea is that two items are similar if they are connected to other items
that are similar. Although the definition is obviously recursive, it can
be evaluated based on Eigenvectors, again producing what we believe to be
more refined similarity results. In both areas of work we are developing a
variety of algorithms, implementing them, and we plan to perform
scalability and sensitivity experiments as well as user studies to show
their effectiveness when compared with simpler methods.
- Personalization: We are
currently exploring how to adapt search and ranking mechanisms to take
into account user preferences. For example, the "page rank"
algorithm used by the Google search engine (www.google.com) can be
modified to give bookmark pages additional weight in the recursive page
rank evaluation. We expect that our work in similarity as described above
will provide additional tools for personalization. For example, based on
similarity measures we can provide users with pages or items of potential
interest automatically, and we can perform sophisticated collaborative
filtering.
- Semantics: Currently we are
attacking the problem of how to get from syntactic analyses of sentences
to semantic representations which can be used for information integration
over Web pages. A particular focus is working on the semantic identity
problem: that is, determining when references to "C. D. Manning"
and "Christopher Manning", or just "Chris" refer to
the same person, and when they don't. Natural-language processing
techniques are especially promising for this problem, because its solution
necessarily requires context-specific probabilistic decision making, which
additionally makes use of appropriate prior knowledge. The result is the
ability to create database records on the fly: for instance, information
such as a person's job title, contact details, and responsibilities. This
work extends information extraction or wrapper generation work in that the
relation(s) to be captured are not fixed in advance.
- Data Mining: Our current
focus is on automating the subjective evaluation of strategies for
clustering Web documents by topic. The space of alternatives tried
includes mixtures of words contained in the documents, words in the
"anchor text" leading to the document, and words surrounding the
anchor. To convert the subjective to something that can be measured
without human intervention, different measures of document similarity are
evaluated on the assumption that the open directory (dmoz.org) represents
ground truth. There, document authors have subjectively classified their
documents in a way that lets us estimate the similarity of documents by
how close their lowest common ancestor in the hierarchy is. The conclusion
of the study is that the best of the tried approaches are those that look
at text surrounding the anchor, with a weight for closeness to the anchor.
Project References
Since the project is in its very early stages, currently the best references
are the papers listed under Publications and Products above.
Area Background
The World-Wide Web has created a resource comprising much of the world's
knowledge and is incorporating a progressively larger fraction each year.
Ideally, what is known by one should be known by (or at least available to)
all. Yet today our ability to use the Web as an information resource -- whether
to advance science, to enhance our lives, or just to get the best deal on a
videotape -- is in a primitive state. Information on the Web is often hard to
find and may be of dubious quality. Although information is presented in a
universal HTML format, there are many fundamental differences across sites:
words have different meanings, information is structured differently or not at
all, and query interfaces vary widely. There are few ways to protect
intellectual property on the Web, and the privacy of users is frequently
compromised.
In spite of these shortcomings, at this point it is not practical to replace
the Web as the underlying framework for sharing information and interacting with
people. Thus, our ultimate goal is not to replace the Web with a new
information resource, but rather to add functionality and tools to the Web --
to transform it into the Global InfoBase. Achieving this goal is by no means an
easy task, but we believe that we can make fundamental advances in how
information is created, searched, integrated, and managed on the Web, bringing
us significantly closer to the GIB vision.
Area References
Please see Publications and Products above.