FROM THE WEB TO THE GLOBAL INFOBASE

Hector Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science, Stanford University

Contact Information

Hector Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom

WWW page

Sites relevant to the grant include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.

Project Award Information

Award Number: IIS-0085896
Duration: 9/1/00-8/31/03
Title: From the Web to the Global InfoBase

Keywords

World-Wide Web, information retrieval, database, natural-language processing, data mining

Project Summary

Our proposed work is driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal information resource, simple to use, up to date, and comprehensive. The project consists of four interrelated thrusts: (i) Combining Technologies: integrating technologies for information retrieval, database management, and hypertext navigation, to achieve a "universal" information model; (ii) Personalization: developing tools for personalizing information management; (iii) Semantics: Using natural-language processing and structural techniques for analyzing the semantics of Web pages; and (iv) Data Mining: designing new algorithms for mining information in order to synthesize new knowledge.

Publications and Products

Sriram Raghavan and Hector Garcia-Molina. Integrating Diverse Information Management Systems: A Brief Survey. Proceedings of the IEEE Data Engineering Bulleting, December 2001.
Cheng Yang. "Music Database Retrieval Based on Spectral Similarity." In International Symposium on Music Information Retrieval, October 2001.
T. Haveliwala. Search Facilities for Internet Relay Chat. To appear in Proceedings of the Joint Conference on Digital Libraries (Poster session), 2002.
C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. To appear: SIGMOD 2002.
C. Olston and J. Widom. Approximate Caching for Continuous Queries over Distributed Data Sources . February 2002 Technical Report.
C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001. International Conference on Management of Data, May 2001.
D. Klein and T. Haveliwala. Concise Labeling of Document Clusters. Submitted. Technical Report, Stanford University, April 2002.
Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden Web. Proceedings of the 27th Intl. Conf. on Very Large Databases (VLDB), pp. 129-138, September 2001.
T. Haveliwala. Topic-Sensitive PageRank. Proceedings of the Eleventh International World Wide Web Conference, 2002.
T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating Strategies for Similarity Search on the Web. Proceedings of the Eleventh International World Wide Web Conference, 2002.
D. Klein, S. Kamvar, and C. Manning. From Instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering. Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
S. Kamvar, D. Klein, and C. Manning. Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based Approach. Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
Glen Jeh and Jennifer Widom. SimRank: A Measure of Structural-Context Similarity. Technical Report, Computer Science Department, Stanford University, 2001.
Glen Jeh and Jennifer Widom. Scaling Personalized Web Search. Technical Report, Computer Science Department, Stanford University, 2002.

Project Impact

The project is funding several graduate students with an expectation that their work in the project will lead to Ph.D. theses.
We have involved several undergraduate students, in particular supporting a couple a year performing research during summer through Stanford's new CURIS internship program.
We are initiating a new course next year on information retrieval extraction, and advanced web technologies, cotaught with researchers from local industry
Currently we are collaborating with the IBM Almaden Research Center, and we expect other industrial collaborations to develop as the project matures.

Goals, Objectives, and Targeted Activities

· Combining Technologies: We cannot achieve the challenging goals of our Global Information Base without (a) extending existing technologies to work in vast information spaces and, (b) developing new technologies where existing ones fail to scale. We investigated: Multi-Model Queries [1] examines how relational and full-text queries may be combined to yield results over multiple sources with radically different data models (e.g.,"Find all Web pages that contain the phrase 'National Science Foundation', and that are being linked to by at least 10 other Web pages."). For our testbed, we use WebBase, a 120 million page repository, developed as part of our Digital Library sister project. Search Over New Sources: We have been integrating new classes of information into the Global Infobase: Music and Internet chat rooms. Sound analysis is a notoriously difficult problem, especially in the realm of music, where acoustically very different signals nevertheless represent at some semantic level identical material. In [2] we document our new system to retrieve similar music pieces from an audio database without metadata or other symbolic information. The real-time, conversational nature of Internet relay chat (IRC) poses a number of interesting problems with respect to indexing archives for effective search and we present our preliminary results in [3]. Finally, in our technology integration thrust, we developed new algorithms for scalable saches that serve vast numbers of cooperating sources. In [4], we present a best-effort synchronization scheduling policy that exploits cooperation between data sources and the cache.

· Personalization: If our Global Information Base harbors a dark side, it is an escalation of the well-documented 'information overload' problem. We have therefore focused on the personalization in the interaction with information sources. Context-Sensitive Search: Many Web search engines compute absolute rankings for Web pages. While highly effective, this ranking does not take into account the user's context. In [9], we developed and implemented a topic-sensitive link-based ranking measure for web search, which exploits search context, including query context (e.g., query history) and user context (e.g., bookmarks and browsing history) to enhance the precision of search engines. Structure-Based Similarity: In [13] and [14] we document our efforts in deducing Web page similarity through the analysis of Web structure. For example, pages with common parentage might be related. Personalized-Precision Retrieval: Building on our scalable caching work [4], we were able to provide personalization of querying in a novel area. We enable users to specify standing queries augmented by a specification of how up-to-date the results need to be. This quality of service measure is important to our Global Information Base, because standing queries are only feasible if the system has some 'wiggle room' to optimize its operation [5,6].

· Semantics: We are attacking problems in getting more meaning out of web pages than simply lists of words they contain. Much work has been done on clustering documents, but little on the problem of labeling clusters. For effective human navigation, the quality of the labeling is at least as important as the quality of the underlying clustering technique, and [7] studies the effectiveness of current labeling techniques and devises algorithms for generating more effective labels. In [8], we address the problem of designing a crawler capable of extracting content from the hidden Web (pages behind forms). We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. In ongoing unpublished work, we are continuing work on web wrappers. Many web sites contain large sets of pages generated from a database using a common template. We are developing an algorithm to extract database values from web pages which uses sets of words that have similar occurrence patterns in the input pages to construct the template. Experiments show that the extracted values make semantic sense in most cases.

· Data Mining: Work in this area applies data mining and machine learning techniques to automate tasks of web analysis. In [10], we developed a methodology for evaluating various strategies for similarity search on the web, using the Open Directory (a free Yahoo!-like hierarchy) as an external quality measure. The space of alternatives tried includes link structure, words in the documents, in "anchor text" leading to the document, and words surrounding the anchor. The best results include use of text surrounding the anchor, with a weight for closeness. Clustering is a central problem in exploratory data mining, with strong web applications. We have explored two more fundamental pieces of research. [11] presents an improved method for data clustering in the presence of sparse prior knowledge, given in the form of pairwise instance constraints. By allowing these constraints to have space-level effects, we are able to exploit constraints more effectively than prior work. In [12], we discuss that classical hierarchical agglomerative clustering methods, while widely used, have lacked a solid theoretical foundation, and remedy this situation by providing probabilistic generative models for these methods.

Project References.

The main references are listed under Publications and Products above.

Area Background

The World-Wide Web has created a resource comprising much of the world's knowledge and is incorporating a progressively larger fraction each year. Yet today our ability to use the Web as an information resource -- whether to advance science, to enhance our lives, or just to get the best deal on a videotape -- is in a primitive state. Information on the Web is often hard to find and may be of dubious quality. Although information is presented in a universal HTML format, there are many fundamental differences across sites: words have different meanings, information is structured differently or not at all, and query interfaces vary widely. Our ultimate goal is not to replace the Web with a new information resource, but rather to add functionality and tools to the Web -- to transform it into the Global InfoBase.

Area References.

Please see Publications and Products above.