METHOD FOR EFFICIENTLY BUILDING COMPACT MODELS FOR LARGE MULTI-CLASS TEXT CLASSIFICATION

Selvaraj, Sathiya Keerthi ;Pavlov, Dmitry ;Gaffney, Scott J.; Mayoraz, Nicolas Eddy; Berkhin, Pavel; Krishnan, Vijay; Sellamanickam, Sundararajan
PatentUnited States Patent Application 20090274376

Abstract

A method of classifying documents includes: specifying multiple documents and classes, wherein each document includes a plurality of features and each document corresponds to one of the classes; determining reduced document vectors for the classes from the documents, wherein the reduced document vectors include features that satisfy threshold conditions corresponding to the classes; determining reduced weight vectors for relating the documents to the classes by comparing combinations of the reduced weight vectors and the reduced document vectors and separating the corresponding classes; and saving one or more values for the reduced weight vectors and the classes. Specific embodiments are directed to formulations for determining the reduced weight vectors including one-versus-rest classifiers, maximum entropy classifiers, and direct multiclass Support Vector Machines.

Web Spam Detection with Anti-Trust Rank

Vijay Krishnan and Rashmi Raj.
ConferenceAIRWeb 2006, the ACM SIGIR 2006 Workshop on Adversarial Information Retrieval on the Web.

Abstract

Spam pages on the web use various techniques to artificially achieve high rankings in search engine results. Human ex- perts can do a good job of identifying spam pages and pages whose information is of dubious quality, but it is practical ly infeasible to use human effort for a large number of pages. Similar to the approach in [1], we propose a method of se- lecting a seed set of pages to be evaluated by a human. We then use the link structure of the web and the manually labeled seed set, to detect other spam pages. Our experi- ments on the WebGraph dataset [3] show that our approach is very effective at detecting spam pages from a small seed set and achieves higher precision of spam page detection than the Trust Rank algorithm, apart from detecting pages with higher pageranks, on an average.

An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition.

Vijay Krishnan and Christopher D. Manning
ConferenceCOLING-ACL 2006, Sydney, Australia, July 2006.

Abstract

This paper shows that a simple two-stage approach to handle non-local dependencies in Named Entity Recognition (NER) can outperform existing approaches that handle non-local dependencies, while being much more computationally efficient. NER systems typically use sequence models for tractable inference, but this makes them unable to capture the long distance structure present in text. We use a Conditional Random Field (CRF) based NER system using local features to make pre- dictions and then train another CRF which uses both local information and features extracted from the output of the first CRF. Using features capturing non-local depen- dencies from the same document, our approach yields a 12.6% relative error re- duction on the F1 score, over state-of-the art NER systems using local-information alone, when compared to the 9.3% relative error reduction offered by the best systems that exploit non-local information. Our approach also makes it easy to incorporate non-local information from other documents in the test corpus, and this gives us a 13.3% error reduction over NER systems using local-information alone. Additionally, our running time for inference is just the inference time of two sequential CRFs, which is much less than that of other more complicated approaches that directly model the dependencies and do approximate inference.

Enhanced Answer Type Inference from Questions using Sequential Models

Vijay Krishnan, Sujatha Das and Soumen Chakrabarti
ConferenceHLT/EMNLP 2005, Vancouver, B.C, Canada, October 2005.

Abstract

Question classification is an important step in factual question answering (QA) and other dialog systems. Several attempts have been made to apply statistical machine learning approaches, including Support Vector Machines (SVMs) with sophisticated features and kernels. Curiously, the payoff beyond a simple bag-of words representation has been small. We show that most questions reveal their class through a short contiguous token subsequence, which we call its informer span. Perfect knowledge of informer spans can enhance accuracy from 79.4% to 88% using linear SVMs on standard benchmarks. In contrast, standard heuristics based on shallow pattern-matching give only a 3% improvement, showing that the notion of an informer is non-trivial. Using a novel multi-resolution encoding of the question’s parse tree, we induce a Conditional Random Field (CRF) to identify informer spans with about 85% accuracy. Then we build a meta-classifier using a linear SVM on the CRF output, enhancing accuracy to 86.2%, which is better than all published numbers

Shortcomings of Latent Models in Supervised Settings

Vijay Krishnan
ConferenceACM SIGIR 2005, Salvador, Brazil, August 2005.

Abstract

The Aspect Model [1, 2] and the Latent Dirichlet Allocation Model [3, 4] are latent generative models proposed with the objective of modeling discrete data such as text. Though it is not explicitly published (to the best of our knowledge), it is reasonably well known in the research community that the Aspect Model does not perform very well in supervised settings and also that latent models are frequently not identifiable, i.e. their optimal parameters are not unique. In this paper, we make a much stronger claim about the pitfalls of commonly-used latent models. By constructing a small, synthetic, but by no means unrealistic corpus, we show that latent models have inherent limitations that prevent them from recovering semantically meaningful param- eters from data generated from a reasonable generative distribution. In fact, our experiments with supervised classification using the Aspect Model, showed that its performance was rather poor, even worse than Naive Bayes, leading us to the synthetic study. We also analyze the scenario of using tempered EM and show that it would not plug the above shortcomings. Our analysis suggests that there is also some scope for improvement in the Latent Dirichlet Allocation Model(LDA) [3, 4]. We then use our insight into the shortcomings of these models, to come up with a promising variant of the LDA, that does not suffer from the aforesaid drawbacks. This could potentially lead to much better performance and model fit, in the supervised scenario.

On Addressing Efficiency Concerns in Privacy-Preserving Mining.

Shipra Agrawal, Vijay Krishnan and Jayant Haritsa.
Conference9th Intl. Conf. on Database Systems for Advanced Applications (DASFAA), Jeju Island, Korea, March 2004.

Abstract

Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To encourage users to provide correct inputs, we recently proposed a data distortion scheme for association rule mining that simultaneously provides both privacy to the user and accuracy in the mining results. However, mining the distorted database can be orders of magnitude more time-consuming as compared to mining the original database. In this paper, we address this issue and demonstrate that by (a) generalizing the distortion process to perform symbol-specific distortion, (b) appropriately chooosing the distortion parameters, and (c) applying a variety of optimizations in the reconstruction process, runtime efficiencies that are well within an order of magnitude of undistorted mining can be achieved.

Text search enhanced with types and entities.

Soumen Chakrabarti, Sujatha Das, Vijay Krishnan, Kriti Puniyani.
BookIn Text Mining: Theory, Application, and Visualization, Srivastava and Sahami, eds., 2008.

Abstract

Giving a broad perspective of the field from numerous vantage points, Text Mining: Classification, Clustering, and Applications focuses on statistical methods for text mining and analysis. It examines methods to automatically cluster and classify text documents and applies these methods in a variety of areas, including adaptive information filtering, information distillation, and text search. The book begins with chapters on the classification of documents into predefined categories. It presents state-of-the-art algorithms and their use in practice. The next chapters describe novel methods for clustering documents into groups that are not predefined. These methods seek to automatically determine topical structures that may exist in a document corpus. The book concludes by discussing various text mining applications that have significant implications for future research and industrial use. There is no doubt that text mining will continue to play a critical role in the development of future information systems and advances in research will be instrumental to their success. This book captures the technical depth and immense practical potential of text mining, guiding readers to a sound appreciation of this burgeoning field.

Comparison of Automated and Human Assignment of MeSH Terms on Publicly-Available Molecular Datasets.

David J. Ruau, Michael Mbagwu, Joel T. Dudley, Vijay Krishnan, Atul J. Butte.
JournalJournal of Biomedical Informatics 2011 by Elsevier publishers.

Abstract

Publicly available molecular datasets can be used for independent verification or investigative repurposing, but depends on the presence, consistency and quality of descriptive annotations. Annotation and indexing of molecular datasets using well-defined controlled vocabularies or ontologies enables accurate and systematic data discovery, yet the majority of molecular datasets available through public data repositories lack such annotations. A number of automated annotation methods have been developed; however few systematic evaluations of the quality of annotations supplied by application of these methods have been performed using annotations from standing public data repositories. Here, we compared manually-assigned Medical Subject Heading (MeSH) annotations associated with experiments by data submitters in the PRoteomics IDEntification (PRIDE) proteomics data repository to automated MeSH annotations derived through the National Center for Biomedical Ontology Annotator and National Library of Medicine MetaMap programs. These programs were applied to free-text annotations for experiments in PRIDE. As many submitted datasets were referenced in publications, we used the manually curated MeSH annotations of those linked publications in MEDLINE as "gold standard". Annotator and MetaMap exhibited recall performance 3-fold greater than that of the manual annotations. We connected PRIDE experiments in a network topology according to shared MeSH annotations and found 373 distinct clusters, many of which were found to be biologically coherent by network analysis. The results of this study suggest that both Annotator and MetaMap are capable of annotating public molecular datasets with a quality comparable, and often exceeding, that of the actual data submitters, highlighting a continuous need to improve and apply automated methods to molecular datasets in public data repositories to maximize their value and utility.