From hector Fri Dec 11 17:19:28 1992
Return-Path: <hector>
Received:  by Coke.Stanford.EDU (5.57/25-DB-eef) id AA02970; Fri, 11 Dec 92 17:19:27 -0800
Date: Fri, 11 Dec 92 17:19:27 -0800
From: Hector Garcia-Molina <hector@coke.stanford.edu>
Message-Id: <9212120119.AA02970@Coke.Stanford.EDU>
To: tsimmis-heads@db.stanford.edu
Subject: my conclusions
Cc: hector
Status: RO

Today we met to discuss the classifier/extractor part of our
project. I think the meeting went very well and it gave us a good
idea of what work can be done in this area.
Just for the record, I think I "volunteered" our IBM colleagues :-)
to try to get us the following info/data (or a subset):

(1) The code for the classifier

(2) A set of files to be analyzed. We obviously have many files
here at Stanford that we could analyze, but we may not have
all 30 types handy. It would be convenient to have sort of
a benchmark suite of files to be analyzed.

(3) The regular expressions (~200 of them) that are used to
produce the vector of initial properties.

(4) A file (hopefully large) of sample vectors.
Each line of the file will contain the type of the source file
(e.g., latex, RFC844...) and the 200 bit vector of properties.
We can use this data to try out new clustering algorithms,
or good nearness functions.

(5) A reference list of good papers to read, including papers
on Information Retrieval with relevance feedback and clustering,
data mining, techniques for efficient regular expression matching, etc.
We will continue to add references to our list...

Item (1) above may be difficult to obtain, and that is why item (4)
would be very useful. If we do get (1), then (4) may be redundant
(although convenient).

What follows is more of my opinion, so please feel free to disagree.
Anyway, it seems to me that the following may be good problems
for us (stanford) to start thinking about:

(a) How to do regular expression matching efficiently,
maybe using approximate techniques and parallel checking,
focusing of scale issues (lots of expressions, lots of files)

(b) How to cluster vectors, how to identify new vectors,
how to select most important bits for clustering/matching...
We need to relate this do current work in data mining...
Again, focus can be on "scale" and on trying approximate techniques say.

(c) Studying other alternatives to the rufus/vector approach, e.g., CART,
neural networks. There may be software availalbe (Arthur says)
that we may test and compare against Rufus...

There were a lot of other ideas presented, but my feeling was that they were
closer tied to the Rufus prototype, so they may be
best addressed at Almaden, e.g., by one of our students spending a summer
at Almaden...

Anyway, I hope I am not beeing to pushy asking for all
the items above, but I think they could really help us
focus work in this area. Also, if you have comments on my above
"conclusions" please let me know...

Since this may be my last message to tsimmis-heads for 1992,
(I will be out Dec 20 - Jan 3), let me wish everyone a great
Christmas holiday! And don't forget our Jan 8 meeting! 

hector

