Hector
Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom
(co-PI)
Department of Computer Science, Stanford University
Hector Garcia-Molina,
Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom
Our projects main
Web page is at http://www-db.stanford.edu/gib/.
Other sites relevant to the grant
include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.
Project Award
Information
Keywords
World-Wide Web,
information retrieval, database, natural-language processing, data mining
Project Summary
Our proposed work
is driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal
information resource, simple to use, up to date, and comprehensive. The project
consists of four interrelated thrusts: (i) Combining Technologies:
integrating technologies for information retrieval, database management, and
hypertext navigation, to achieve a "universal" information model;
(ii) Personalization: developing tools for personalizing information
management; (iii) Semantics: Using natural-language processing and
structural techniques for analyzing the semantics of Web pages; and (iv) Data
Mining: designing new algorithms for mining information in order to
synthesize new knowledge.
Publications and Products
Project Impact
Goals,
Objectives, and Targeted Activities
· Combining Technologies: One of the
main goals of our project was to develop a “universal” data model that combined
facilities for simultaneously accessing text, relational data, and hypertext
links. We have developed such a model, as well as a prototype for executing and
optimizing queries in that model. In
particular, we have developed: (i)~a formal data model for precisely defining
query semantics, (ii)~a query algebra for formulating complex queries, and
(iii)~techniques to optimize and efficiently execute these queries over massive
data sets. Our algebra extends
traditional relational operators as well as graph navigation operators, adding
the notion of ranked, and ordered relations.
We have applied our model and optimizer on our WebBase repository (see
below), and have shown how Web pages can be searched based on their content,
their properties (e.g., Page rank), or their link structure. We have measured the running time of
representative complex Web queries, and have shown that our optimized can
reduce running time by a factor of 2 or 3 in many cases.
· Personalization:
Our Global Information Base makes accessible so much information that we must
be very careful not to exacerbate the well-documented 'information overload'
problem. We address information
overload by enabling personalized views of information sources, particularly Web
sources. Most Web search engines
compute absolute rankings for Web pages, independent of the user or search
context. In [13] we developed and
implemented a topic-sensitive link-based ranking measure for web search that
exploits search context, including query context (e.g., query history) and user
context (e.g., bookmarks and browsing history) to enhance the precision of
search engines. We then took a further
step in personalization, enabling a different link-based ranking measure for
each user, fully customizable. This
work is described in [18].
“Personalized PageRank” enables users to specify any set of web pages as
'more important' when computing link-based importance rankings. Once the problem was formalized, the
challenge was primarily one of efficiency and scalability: In practice
Personalized PageRank may need to compute 2-to-the-N different rankings, where
N is the number of pages on the Web!
Our work focused on efficient algorithms that combine offline and online
computation, and share state and computation as much as possible while
computing fully personalized and customized rankings.
· Semantics:
We continue to attack problems involved in getting more meaning out of web
pages and other text than simply lists of the words that they contain. In [12], we address the problem of
designing a crawler capable of extracting content from the hidden Web (pages
behind forms). We introduce a new Layout-based Information Extraction Technique
(LITE) and demonstrate its use in automatically extracting semantic information
from search forms and response pages.
In [2], we present results on learning web wrappers for sites with
regular structure. Many web sites contain large sets of pages generated from a
database using a common template. We have developed an algorithm to extract
database values from web pages which uses sets of words that have similar
occurrence patterns in the input pages to construct the template. Experiments
show that the extracted values make semantic sense in most cases. Other web pages contain unstructured text
paragraphs, and [25] examines methods for recognizing person, company, and
other names for indexing and extraction, particularly exploiting discriminative
machine learning techniques, and character-based models, which are very robust
(for instance largely immune to the kind of obfuscation techniques used by
spammers). [26] examines the use of
spectral methods (related to those used in link analysis methods like PageRank)
for the text-based tasks of document clustering and classification, showing how
the eigenspace can efficiently bring out the groupings of documents in the
space. [27, 28, 29 30] provide methods
of doing natural language analysis, finding parts of speech and parsing sentences,
particularly focusing on methods that can be efficiently implemented as well as
of high accuracy. This kind of analysis
is a basis for more detailed semantic understanding of unstructured text
passages.
· Data Mining: We have
continued our work on mining the Web. A
central component of this work is our WebBase repository, a testbed for
supporting research that examines many aspects of the World Wide Web. These
aspects include experimentation towards understanding, optimally utilizing, and
improving the Web. The system's highly configurable crawlers collect large
numbers of Web pages, and store them locally. We test novel algorithms, such as
for ranking, filtering, or Web linkage mapping on this collection. In addition,
we enable other researchers around the world to request high-speed `feeds' of
all or part of our collection. Such a feed allows these researchers to focus on
their investigations, rather than on constructing crawlers. In fact, even with
our present, small system, we have been able to help colleges around the
country, supporting them in both their research and teaching.
Using WebBase, we have been studying methods for speeding up the
computation of PageRank. In addition to
making a very time-consuming operation more efficient, our results make search
personalization (i.e., give each individual a Web-page-weighting function that
reflects their personal interests; see above) feasible. In particular, in [20], we look at the
sequence of 50 or so iterations of the matrix-vector multiplication that
converges to the PageRank (principal eigenvector of the matrix). We observe that, in a manner reminiscent of
Newton's method,we can predict where the terms of the vector are heading, and
"jump ahead" in the calculation, approximately. The net effect is that about half the
iterations are needed for a fixed level of convergence using this technique,
while the extrapolation itself adds almost nothing to the cost of a single
iteration. In [21] we take a completely
different approach. Demonstrating the fact that there are clearly defined and
easily discoverable regions of the Web that are dense in links (e.g., a
university's portion of the.edu domain, or a typical person's Web site) we
propose computing PageRank by first clustering the Web into some number of
these regions, then computing the local PageRank within regions. Next,we use the local PageRank to estimate
how much importance one region will pass to another, and compute the RageRank
on the graph of regions only. Finally,
the ordinary PageRank calculation is initialized with each page having the
estimated PageRank equal to the product of its local PageRank and the PageRank
of its region.
Project References.
The main references
are listed under Publications and Products above.
Area Background
The World-Wide Web
has created a resource comprising much of the world's knowledge and is
incorporating a progressively larger fraction each year. Yet today our ability
to use the Web as an information resource -- whether to advance science, to
enhance our lives, or just to get the best deal on a videotape -- is in a
primitive state. Information on the Web is often hard to find and may be of
dubious quality. Although information is presented in a universal HTML format,
there are many fundamental differences across sites: words have different
meanings, information is structured differently or not at all, and query
interfaces vary widely. Our ultimate goal is not to replace the Web with a new
information resource, but rather to add functionality and tools to the Web --
to transform it into the Global InfoBase.
Area References.
Please see Publications and Products above.