Report Number: CS-TR-98-1615
Institution: Stanford University, Department of Computer Science
Title: Using Machine Learning to Improve Information Access
Author: Sahami, Mehran
Date: December 1998
Abstract: We address the problem of topical information space navigation. Specifically, we combine query tools with methods for automatically creating topic taxonomies in order to organize text collections. Our system, named SONIA (Service for Organizing Networked Information Autonomously), is implemented in the Stanford Digital Libraries testbed. It employs several novel probabilistic Machine Learning methods that enable the automatic creation of dynamic topic hierarchies based on the full-text content of documents. First, to generate such topical hierarchies, we employ a novel clustering scheme that outperforms traditional methods used in both Information Retrieval and Probabilistic Reasoning. Furthermore, we develop methods for classifying new articles into such automatically generated, or existing manually generated, hierarchies. Our method explicitly uses the hierarchical relationships between topics to improve classification accuracy. Much of this improvement is derived from the fact that the classification decisions in such a hierarchy can be made by considering only the presence (or absence) of a small number of features (words) in each document. The choice of relevant words is made using a novel information theoretic algorithm for feature selection. The algorithms used in SONIA are also general enough to have been successfully applied to data mining problems in different domains than text.