Hector Garcia-Molina (PI), Chris Manning
(co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science, Stanford University
Hector Garcia-Molina, Chris Manning, Jeff
Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom
Sites relevant to the grant include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.
Project Award Information
Keywords
World-Wide Web, information retrieval,
database, natural-language processing, data mining
Project Summary
Our proposed work is driven by the vision of
a Global InfoBase (GIB): a ubiquitous and universal information resource, simple
to use, up to date, and comprehensive. The project consists of four
interrelated thrusts: (i) Combining Technologies: integrating
technologies for information retrieval, database management, and hypertext
navigation, to achieve a "universal" information model; (ii) Personalization:
developing tools for personalizing information management; (iii) Semantics:
Using natural-language processing and structural techniques for analyzing the
semantics of Web pages; and (iv) Data Mining: designing new algorithms for
mining information in order to synthesize new knowledge.
Publications and Products
Project Impact
Goals, Objectives, and Targeted Activities
· Combining Technologies: We cannot
achieve the challenging goals of our Global Information Base without (a)
extending existing technologies to work in vast information spaces and, (b)
developing new technologies where existing ones fail to scale. We
investigated: Multi-Model Queries [1] examines how relational and
full-text queries may be combined to yield results over multiple sources with
radically different data models (e.g.,"Find all Web pages that contain the
phrase 'National Science Foundation', and that are being linked to by at least
10 other Web pages."). For our testbed, we use WebBase, a 120
million page repository, developed as part of our Digital Library sister
project. Search Over New Sources: We have been integrating new
classes of information into the Global Infobase: Music and Internet chat rooms.
Sound analysis is a notoriously difficult problem, especially in the realm of
music, where acoustically very different signals nevertheless represent at some
semantic level identical material. In [2] we document our new system to
retrieve similar music pieces from an audio database without metadata or other
symbolic information. The real-time, conversational nature of Internet relay
chat (IRC) poses a number of interesting problems with respect to indexing
archives for effective search and we present our preliminary results in [3].
Finally, in our technology integration thrust, we developed new
algorithms for scalable saches that serve vast numbers of cooperating
sources. In [4], we present a best-effort synchronization scheduling policy
that exploits cooperation between data sources and the cache.
· Personalization: If our Global
Information Base harbors a dark side, it is an escalation of the
well-documented 'information overload' problem. We have therefore focused on
the personalization in the interaction with information sources. Context-Sensitive
Search: Many Web search engines compute absolute rankings for Web pages.
While highly effective, this ranking does not take into account the user's context.
In [9], we developed and implemented a topic-sensitive link-based ranking
measure for web search, which exploits search context, including query context
(e.g., query history) and user context (e.g., bookmarks and browsing history)
to enhance the precision of search engines. Structure-Based Similarity:
In [13] and [14] we document our efforts in deducing Web page similarity
through the analysis of Web structure. For example, pages with common parentage
might be related. Personalized-Precision Retrieval: Building on our
scalable caching work [4], we were able to provide personalization of querying
in a novel area. We enable users to specify standing queries augmented by a
specification of how up-to-date the results need to be. This quality of
service measure is important to our Global Information Base, because standing
queries are only feasible if the system has some 'wiggle room' to optimize its
operation [5,6].
· Semantics: We are attacking problems in
getting more meaning out of web pages than simply lists of words they contain.
Much work has been done on clustering documents, but little on the problem of labeling
clusters. For effective human navigation, the quality of the labeling is at
least as important as the quality of the underlying clustering technique, and
[7] studies the effectiveness of current labeling techniques and devises
algorithms for generating more effective labels. In [8], we address the problem
of designing a crawler capable of extracting content from the hidden Web
(pages behind forms). We introduce a new Layout-based Information Extraction
Technique (LITE) and demonstrate its use in automatically extracting semantic
information from search forms and response pages. In ongoing unpublished work,
we are continuing work on web wrappers. Many web sites contain large sets of
pages generated from a database using a common template. We are developing an
algorithm to extract database values from web pages which uses sets of words
that have similar occurrence patterns in the input pages to construct the
template. Experiments show that the extracted values make semantic sense in
most cases.
· Data Mining: Work in this area applies
data mining and machine learning techniques to automate tasks of web analysis.
In [10], we developed a methodology for evaluating various strategies for similarity
search on the web, using the Open Directory (a free Yahoo!-like hierarchy)
as an external quality measure. The space of alternatives tried includes link
structure, words in the documents, in "anchor text" leading to the
document, and words surrounding the anchor. The best results include use of
text surrounding the anchor, with a weight for closeness. Clustering is
a central problem in exploratory data mining, with strong web applications. We
have explored two more fundamental pieces of research. [11] presents an
improved method for data clustering in the presence of sparse prior knowledge,
given in the form of pairwise instance constraints. By allowing these
constraints to have space-level effects, we are able to exploit constraints
more effectively than prior work. In [12], we discuss that classical
hierarchical agglomerative clustering methods, while widely used, have lacked a
solid theoretical foundation, and remedy this situation by providing probabilistic
generative models for these methods.
Project References.
The main references are listed under Publications
and Products above.
Area Background
The World-Wide Web has created a resource comprising
much of the world's knowledge and is incorporating a progressively larger
fraction each year. Yet today our ability to use the Web as an information
resource -- whether to advance science, to enhance our lives, or just to get
the best deal on a videotape -- is in a primitive state. Information on the Web
is often hard to find and may be of dubious quality. Although information is
presented in a universal HTML format, there are many fundamental differences
across sites: words have different meanings, information is structured
differently or not at all, and query interfaces vary widely. Our ultimate goal
is not to replace the Web with a new information resource, but rather to add
functionality and tools to the Web -- to transform it into the Global InfoBase.
Area References.
Please see Publications and Products above.