Hector
Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom
(co-PI)
Department of Computer Science, Stanford University
Hector
Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom
Our projects main
Web page is at http://www-db.stanford.edu/gib/.
Other sites relevant to the grant
include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.
Project Award Information
Keywords
World-Wide Web,
information retrieval, database, natural-language processing, data mining
Broader Impact
Activities
There were four noteworthy activities this year (Jul 03-Jun 04):
(1) Three of the students working on the Global InfoBase (GIB) Project started
a company, Kaltix, which was then acquired by Google. The students, Sep Kamvar,
Taher Haveliwala and Glen Jeh, founded Kaltix, based on the search
personalization technology developed by the GIB Project. After the founders
gave talks at the major search engine companies (Google, Yahoo, Microsoft, ...)
describing the GIB technologies and their implementation, Google acquired
Kaltix in late 2003. The founders are now working at Google, and the Google
implementation of the core technology is presently showcased on the Google
Labs site.
(2) In May 2003, David Hart at NSF produced a press release highlighting
our work, initially reported last year, on efficient personalized PageRank
calculation (NSF PR 03-56), timed to coincide with
the World Wide Web 2003 conference. As well as correctly highlighting the
strong linkages between basic mathematical research and real-world computing
advances, this press release brought a lot
of media and industry attention to the work. In many ways that set in train a
sequence of events which led the Kaltix students to set commercialize the
technology and set up the company.
(3) We organized a Stanford-IBM Day on November 7, 2004. This one day workshop
highlighted work done at Stanford and at IBM, including GIB work. About 75
people from Stanford and IBM participated. Details of the event can be found here. This type of event plays an
important technology transfer role, as we at Stanford disseminate our results
to industry, while at the same time
learning what industry is up to and what problems they are facing.
(4) Similarly, we organized a Stanford-Google day April 8, 2004. About 25
Stanford faculty and students, including GIB participants, visited Google and
made presentation on our current work.
Researchers from Google discussed some of the problems they are focusing on.
Again, this event provided another opportunity for technology transfer.
Students
Sep Kamvar, worked on the mathematical foundations of pagerank-style methods,
developing with Taher Haveliwala results on the eigengap and condition number
of the web, and methods for effectively personalizing PageRank computations.
Kamvar completed his PhD and is now working at Google.
Chris Olston, worked in the area of approximate replication. Olston completed
his PhD and is now an Assistant Professor in the Computer Science Department at
Carnegie-Mellon University.
Sriram Raghavan, worked on the integration of information technologies, as well
as on efficient execution of complex web queries. Raghavan is now working at
the IBM Almaden Research Center.
He is expected to complete his thesis by the end of the summer of 2004.
Research Activities
In the period July 2003 - June 2004 we continued work on the GIB project,
touching on the four project areas, combining technologies, personalization,
semantics, and data mining. In what follows we provide some highlights of our
research. For a complete list of publications and additional details, please
visit the project
web site.
· Adaptive Methods for the
Computation of PageRank. We have observed that the convergence patterns of pages in the
PageRank algorithm have a nonuniform distribution. Specifically, many pages
converge to their true PageRank quickly, while relatively few pages take a much
longer time to converge. Furthermore, we observe that these slow-converging
pages are generally those pages with high PageRank. We used this observation to
devise a simple algorithm to speed up the computation of PageRank, in which the
PageRank of pages that have converged are not recomputed at each iteration
after convergence. This algorithm, which we call Adaptive PageRank, speeds up
the computation of PageRank by nearly 30%. For additional details, see:
Sepandar D. Kamvar, Taher H. Haveliwala, and Gene H. Golub. Adaptive Methods for the
Computation of PageRank. International Conference on the Numerical Solution
of Markov Chains, September 2003.
· Computing PageRank using
Power Extrapolation. We have developed a novel technique for speeding up the computation of
PageRank, a hyperlink-based estimate of the ``importance'' of Web pages, based
on the ideas presented in "Extrapolation Methods for Accelerating PageRank
Computations". The original PageRank algorithm uses the Power Method to
compute successive iterates that converge to the principal eigenvector of the
Markov matrix representing the Web link graph. Our new algorithm, called Power
Extrapolation, accelerates the convergence of the Power Method by subtracting
off the error along several nonprincipal eigenvectors from the current iterate
of the Power Method, making use of known nonprincipal eigenvalues of the Web
hyperlink matrix. Empirically, we have shown that using Power Extrapolation
speeds up PageRank computation by 30% on a Web graph of 80 million nodes in
realistic scenarios over the standard power method, in a way that is simple to
understand and implement. For additional details, please see:
Taher Haveliwala, Sepandar Kamvar, Dan Klein, Christopher Manning, and Gene
Golub. Computing
PageRank using Power Extrapolation , Preprint, July 2003.
· An Analytical Comparison of
Approaches to Personalizing PageRank. PageRank, the popular link-analysis algorithm for
ranking web pages, assigns a query and user independent estimate of
"importance" to web pages. Query and user sensitive extensions of
PageRank, which use a basis set of biased PageRank vectors, have been proposed
in order to personalize the ranking function in a tractable way. We have
analytically compared three recent approaches to personalizing PageRank and
have studied the tradeoffs of each one. For additional details, please
see:
Taher Haveliwala, Sepandar D. Kamvar, and Glen Jeh. An Analytical Comparison of
Approaches to Personalizing PageRank, Preprint, June 2003.
· The Condition Number of the
PageRank Problem. We have determined analytically the condition number of the PageRank
problem. Our result has implications for the accuracy to which PageRank can be
computed, currently and as the web scales. Furthermore, it provides a simple
proof that, for the parameters currently used by Google in computing PageRank,
small changes in the link structure of the web do not cause large changes in
the PageRanks of pages of the web. For additional details, please see:
Sepandar D. Kamvar and Taher H. Haveliwala, The Condition Number of the
PageRank Problem , Preprint, June 2003.
· The Second Eigenvalue of the Google
Matrix. We have determined analytically the modulus of the second
eigenvalue for the web hyperlink matrix used by Google for computing PageRank.
Our results have implications for the convergence rate of the standard PageRank
algorithm as the web scales, for the stability of PageRank to perturbations to
the link structure of the web, for the detection of Google spammers, and for
the design of algorithms to speed up PageRank. For additional details, please
see:
Taher H. Haveliwala and Sepandar D. Kamvar, The Second Eigenvalue of the
Google Matrix. Preprint, April 2003.
· Mining Properties of Graph-Structured
Information. We have developed a new data mining system for
graph-structured information, called F-Miner. Prior to F-miner, most data
mining algorithms on graphs looked for nodes satisfying specific properties,
such as specific notions of structural similarity or specific measures of
link-based importance. While such analyses for predetermined properties can be
effective in well-understood domains, sometimes identifying an appropriate
property for analysis can be a challenge, and focusing on a single property may
neglect other important aspects of the data. Glen and Jennifer developed a
foundation for mining the properties themselves, including a theoretical
framework defining the space of graph properties, a variety of mining queries
enabled by the framework, and techniques to handle the enormous size of the
query space. The F-miner prototype system demonstrates the utility and
feasibility of property mining. For additional details, please see:
G. Jeh and J. Widom. Mining
the Space of Graph Properties. To appear in Proceedings of the Tenth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
Seattle, Washington, August 2004.
· Keyword Search over Relational Databases.
We developed a new system called EKSO (Efficient Keyword Search through Offline
indexing) that provides a general-purpose keyword search interface over any relational
database. Typically, relational database systems require the user to learn SQL
and to know the schema of the underlying data even to pose simple searches.
EKSO supports highly efficient keyword-based search over relational databases:
The database is "crawled" in advance, text-indexing virtual documents
that correspond to interconnected database content. At query time, the text
index supports keyword-based searches with instantaneous response, identifying
database objects corresponding to the virtual documents matching the query.
EKSO uses the DB2 Net Search Extender for indexing and keyword-search
processing. Experimental results show that index size is manageable, query
response time is instantaneous, and database updates (which are propagated incrementally
as recomputed virtual documents to the text index) do not significantly hinder
query performance. Qi and Jennifer also performed a user study confirming the
superiority of keyword-based search over SQL for a wide range of database
retrieval tasks. For additional details, please see:
Q. Su and J. Widom. Indexing
Relational Database Content Offline for Efficient Keyword-Based Search.
Submitted for conference publication, May 2004.
· Approximate Replication. Chris Olston completed his Ph.D. thesis under the guidance of Jennifer Widom in the area of approximate replication. The thesis addresses the common case of distributed information management environments where data is replicated across multiple nodes. Keeping replicas exactly consistent poses a significant challenge due to the communication cost, but fortunately in many applications some amount of inaccuracy or staleness may be tolerable, in which case approximate replication is used. Chris's thesis identifies a fundamental tradeoff between precision and performance in these environments, and develops two complementary approaches to working with this tradeoff: (1) Maximize precision of replicas in the presence of constraints on communication costs. (2) Minimize communication cost in the presence of constraints on replica precision. For each approach, the thesis develops a formal problem definition, analyses, algorithms, and implementation techniques. Experiments were conducted with both synthetic and real-world data. A testbed network traffic monitoring system was built using the approximate replication techniques-- it can track usage patterns and flag potential security hazards.
· Exploiting Hierarchical Domain Structure
to Compute Similarity. As part of our efforts to integrate information
technologies, we studied the notion of similarity between objects. This notion
finds use in many contexts, e.g., in search engines, collaborative filtering,
and clustering. Objects being compared often are modeled as sets, with their
similarity traditionally determined based on set intersection.
Intersection-based measures do not accurately capture similarity in certain
domains, such as when the data is sparse or when there are known relationships
between items within sets. We have developed new measures that exploit a
hierarchical domain structure in order to produce more intuitive similarity
scores. We also extended our similarity measures to provide appropriate results
in the presence of multisets (also handled unsatisfactorily by traditional
measures), e.g., to correctly compute the similarity between customers who buy
several instances of the same product (say milk), or who buy several products
in the same category (say dairy products). We have also performed an
experimental comparison of our measures against traditional similarity
measures, including an informal user study that evaluated how well our measures
match human intuition. For additional details, please see:
Ganesan, Prasanna; Garcia-Molina, Hector; Widom, Jennifer. Exploiting
Hierarchical Domain Structure to Compute Similarity, ACM Transactions on
Information Systems (TOIS), Vol. 21, No. 1, 2003, pp. 64-93.