Report Number: CS-TR-98-1615
Institution: Stanford University, Department of Computer Science
Title: Using Machine Learning to Improve Information Access
Author: Sahami, Mehran
Date: December 1998
Abstract: We address the problem of topical information space
navigation. Specifically, we combine query tools with methods
for automatically creating topic taxonomies in order to
organize text collections. Our system, named SONIA (Service
for Organizing Networked Information Autonomously), is
implemented in the Stanford Digital Libraries testbed. It
employs several novel probabilistic Machine Learning methods
that enable the automatic creation of dynamic topic
hierarchies based on the full-text content of documents.
First, to generate such topical hierarchies, we employ a
novel clustering scheme that outperforms traditional methods
used in both Information Retrieval and Probabilistic
Reasoning. Furthermore, we develop methods for classifying
new articles into such automatically generated, or existing
manually generated, hierarchies. Our method explicitly uses
the hierarchical relationships between topics to improve
classification accuracy. Much of this improvement is derived
from the fact that the classification decisions in such a
hierarchy can be made by considering only the presence (or
absence) of a small number of features (words) in each
document. The choice of relevant words is made using a novel
information theoretic algorithm for feature selection. The
algorithms used in SONIA are also general enough to have been
successfully applied to data mining problems in different
domains than text.
http://i.stanford.edu/pub/cstr/reports/cs/tr/98/1615/CS-TR-98-1615.pdf