FROM THE WEB TO THE GLOBAL INFOBASE 2003

FROM THE WEB TO THE GLOBAL INFOBASE -- FINAL REPORT

Hector Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science, Stanford University

Contact Information

Hector Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom

WWW page

Our projects main Web page is at http://www-db.stanford.edu/gib/. Other sites relevant to the grant include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.

Project Award Information

Award Number: IIS-0085896
Duration: 9/1/00-8/31/03
Title: From the Web to the Global InfoBase

Keywords

World-Wide Web, information retrieval, database, natural-language processing, data mining

Project Summary

Our proposed work was driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal information resource, simple to use, up to date, and comprehensive. The project consisted of four interrelated thrusts: (i) Combining Technologies: integrating technologies for information retrieval, database management, and hypertext navigation, to achieve a "universal" information model; (ii) Personalization: developing tools for personalizing information management; (iii) Semantics: Using natural-language processing and structural techniques for analyzing the semantics of Web pages; and (iv) Data Mining: designing new algorithms for mining information in order to synthesize new knowledge.

Project Impact

The project funded 27 graduate students over its lifetime. Five of them now have faculty positions, at CMU (Olston), Berkeley (Klein), Duke (Babu), Georgia Tech (Cooper) and UC Irvine (Li). The others are working at major corporations (Yahoo, Google, Microsoft, IBM, etc.) or at startups.

We developed a new course on information retrieval extraction, and advanced web technologies, co-taught with researchers from local industry

Our WebBase repository of Web pages (jointly funded by this project and our Digital Library project) is being used for data mining by the following organizations: UC Berkeley, Columbia, U. Washington, Harvey Mudd, Università degli Studi di Milano, U. of Arizona, •California Digital Library, Cornell, U. of Houston, Learning Lab Lower Saxony (L3S), France Telecom, U. Texas.

We wrote a significant number of research papers (which can be found on the project web site and on the fastLane Final report).

Overall Summary of Activities

During the project we conducted research in the main thrust areas: combining technologies, personalization, information semantics and data mining. What follows is a brief, high level summary; additional details can be found the yearly project reports and in our papers.

· Combining Technologies: One of the main goals of our project was to develop a “universal” data model that combined facilities for simultaneously accessing text, relational data, and hypertext links. We developed such a model, as well as a prototype for executing and optimizing queries in that model. We also expanded the types of data and services available in our WebBase prototype, which is now in use by many researchers.

· Personalization: Our Global Information Base makes accessible so much information that we must be very careful not to exacerbate the well-documented 'information overload' problem. We addressed the information overload problem by enabling personalized views of information sources, particularly Web sources. Our techniques allow individual users to receive search results that are personalized to their own interests. The student working on this topic graduated and went on to start a company in the personalization area (Kaltix), which was then acquired by Google.

· Semantics: We addressed problems involved in getting more meaning out of web pages and other text, other than simply extracting the list of words that the pages contain. For example, we studied methods for recognizing entities like people or companies, for indexing and extraction, particularly exploiting discriminative machine learning techniques, and character-based models, which are very robust. We also studied the problem of extracting content from the hidden Web, showing how one can obtain content available through search forms.

· Data Mining: Our data mining work focused on understanding, optimally utilizing, and improving the Web. Our work extensively used our WebBase repository, where the system's highly configurable crawlers collect large numbers of Web pages, and store them locally. Using the system we tesed novel algorithms, such as for ranking, filtering, or Web linkage mapping on this collection. In particular, we developed new schemes for computing PageRank very efficiently (Page Rank lets one rank web pages by their “importance” and is a key metric for Web searching.)

Project References

The main references are listed under Publications and Products above.

Area Background

The World-Wide Web has created a resource comprising much of the world's knowledge and is incorporating a progressively larger fraction each year. Yet today our ability to use the Web as an information resource -- whether to advance science, to enhance our lives, or just to get the best deal on a videotape -- is in a primitive state. Information on the Web is often hard to find and may be of dubious quality. Although information is presented in a universal HTML format, there are many fundamental differences across sites: words have different meanings, information is structured differently or not at all, and query interfaces vary widely. Our ultimate goal is not to replace the Web with a new information resource, but rather to add functionality and tools to the Web -- to transform it into the Global InfoBase.

Area References

Please see Publications and Products above.

Summary of Work Done in Final Period

During the final period (no-cost extension) we performed the following work. The work done earlier is summarized in our earlier reports.

Query Optimization over Web Services

Web services are becoming a standard method of sharing data and functionality among loosely-coupled systems. We studied a general-purpose Web Service Management System (WSMS) that enables querying multiple web services, of the type found in a Global InfoBase, in a transparent and integrated fashion. We have considered the problem of query optimization inside a WSMS for Select-Project-Join queries spanning multiple web services. Our main result to-date is an algorithm for optimally arranging the web services in a query into a pipelined execution plan that minimizes the total running time of the query. We have also developed an algorithm for determining the optimal granularity of data “chunks” to be used for each web service call. Analytical comparisons demonstrate that our algorithms can lead to significant performance improvement over more straightforward techniques.

U. Srivastava, J. Widom, K. Munagala, R. Motwani. Query Optimization over Web Services. Technical Report, October 2005. http://dbpubs.stanford.edu:8090/pub/2005-30

Web Spam

One of the main obstacles to good searching and personalization on the Web today is web spam. This spam is content generated with the intention of misleading search engines and giving certain pages a higher rank than they deserve. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we have developed techniques to semi-automatically separate reputable, good pages from spam. Our approach is based on first selecting a small set of seed pages to be evaluated by an expert. From these reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. Our experiments run on the World Wide Web show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.

Gyongyi, Zoltan; Garcia-Molina, Hector; Pedersen, Jan. Combating Web Spam with TrustRank. Combating Web Spam with TrustRank, International Conference on Very Large Databases (VLDB), Toronto, Canada, August 29, 2004 (with Zoltan Gyongyi). http://dbpubs.stanford.edu/pub/2004-52

Term Extraction in Biomedical Text

As part of our work in exploiting semantics for information extraction, we focused on a particular domain, biomedicine. In particular, we developed a maximum-entropy based system for identifying Named Entities (NEs) in biomedical abstracts and evaluated its performance

in two biomedical Named Entity Recognition (NER) comparative evaluations, namely BioCreative

and Coling BioNLP. Our system obtained an exact match f-score of 83.2% in the BioCreative evaluation and 70.1% in the BioNLP

evaluation. To achieve this performance, our system uses local features, attention to correct boundary identification, innovative

use of external knowledge resources including parsing and web searches, and rapid adaptation to new NE sets.

Jenny Finkel, Shipra Dingare, Huy Nguyen, Malvina Nissim, Christopher Manning, and Gail Sinclair. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web. Joint Workshop on Natural Language Processing in Biomedicine and its Applications at Coling 2004. http://nlp.stanford.edu/~manning/papers/bionlp-camera.pdf

Shipra Dingare, Jenny Finkel, Malvina Nissim, Christopher Manning, and Claire Grover. 2004. A System For Identifying Named Entities in Biomedical Text: How Results From Two Evaluations Reflect on Both the System and the Evaluations. In The 2004 BioLink meeting: Linking Literature, Information and Knowledge for Biology at ISMB 2004. Republished as Shipra Dingare, Malvina Nissim, Jenny Finkel, Christopher Manning, and Claire Grover. 2005. Comparative and Functional Genomics 6: 77-85. http://nlp.stanford.edu/~manning/papers/ismb2004.pdf

Joint Inference

We developed a general model for joint inference in correlated natural language processing tasks when fully annotated training data is not available, and applied this model

to the dual tasks of word sense disambiguation and verb

subcategorization frame determination. Our model uses

the EM algorithm to simultaneously complete partially

annotated training sets and learn a generative probabilistic model over multiple annotations. When applied to the

word sense and verb subcategorization frame determination tasks, the model learns sharp joint probability distributions which correspond to linguistic intuitions about

the correlations of the variables. We have shown that use of the joint model

leads to error reductions over competitive independent

models on these tasks.

Galen Andrew, Trond Grenager, and Christopher Manning. 2004. Verb Sense and Subcategorization: Using Joint Inference to Improve Performance on Complementary Tasks. EMNLP 2004, pp. 150-157. http://nlp.stanford.edu/~manning/papers/synsense.pdf