| CS345A, Winter 2007-8: Data Mining.
|
Course Information
Instructors: Anand
Rajaraman (anand @ kosmix dt com),
Jeffrey D. Ullman (ullman @ gmail dt com).
TA: Babak Pahlavan
Meeting: MW 4:15 - 5:30PM; Room: Math corner basement 380-380X .
Office Hours:
Anand Rajaraman: MW 5:30-6:30pm (after the class in the same room)
Jeff Ullman 2-4PM on the days I teach, in 433 Gates.
Babak Pahlavan (TA) 9:30AM-12:30PM on Wednesdays in Gates Room # 24B.
Prerequisites: CS145 or equivalent.
Materials: There is no text, but students will use the
Gradiance automated homework system for which a nominal fee will be charged.
Notes and/or slides will be posted on-line.
We will also distribute some notes that will become part of the next
edition of Database Systems: The Complete Book (Garcia-Molina, Ullman,
Widom).
You can see earlier versions of
the notes and slides covering Data Mining.
Not all these topics will be covered this year.
Requirements: There will be periodic homeworks (some on-line, using
the Gradiance system), a final exam,
and a project on web-mining, using the Stanford WebBase.
The homework will count just enough to encourage you to do it, about
20%.
The project and final will account for the bulk of the credit, in
roughly equal proportions.
Handouts
Date | Topic | PowerPoint Slides | PDF Document |
1/9 | Introductory Remarks (JDU) | PPT | PDF |
1/9 | Introductory Remarks (AR) | PPT | PDF |
1/14 | Association Rules I (JDU) | PPT | PDF |
1/14-16 | Association Rules II (JDU) | PPT | PDF |
1/16-23 | Map-Reduce (AR) |
PPT |
PDF |
1/23-28 | PageRank (AR) |
PPT |
PDF |
1/28 | HITS and Spam (AR) |
PPT |
PDF |
2/4 | Shingling, Minhashing (JDU) |
PPT |
PDF |
2/6 | Locality-Sensitive Hashing (JDU) |
PPT |
PDF |
2/11 | Recommendation Systems (AR) |
PPT |
PDF |
2/13 | Clustering I (JDU) |
PPT |
PDF |
2/20 | Clustering II (JDU) |
PPT |
PDF |
2/25 | RelationExtraction (AR) |
PPT |
PDF |
2/27 | Advertising (AR) |
PPT |
PDF |
3/3 | Stream Mining I (JDU) |
PPT |
PDF |
3/5 | Stream Mining II (JDU) |
PPT |
PDF |
Assignments
Some of the homework will be on the Gradiance system.
You should go there to open your account, and enter the class code that will
be told to you in class.
You can try
the work as many times as you like, and we hope everyone will eventually
get 100%. The secret is that each of the questions involves a
"long-answer" problem, which you should work. The Gradiance system gives you
random right and wrong answers each time you open it, and thus samples
your knowledge of the full problem. While there are ways to game the
system, we group several questions at a time, so it is hard to get 100%
without actually working the problems. Also notice that you have to
wait 10 minutes between openings, so brute-force random guessing will
not work.
Solutions appear after the problem-set is due. However, you must submit at
least once, so your most recent solution appears with the solutions
embedded.
Project
CS345A Project specification:
- Overview: a software project that discovers or leverages interesting relationships within a
significant amount of data. Best if the project leverages what we have learned in class.
- Some project ideas (these serve merely as ideas. They should by no means restrict your
imagination)
- Implement anti-spam algorithm (e.g. Trust Rank) on a collection of webpages
- Implement a better version of topic-sensitive PageRank on a collection of webpages (by
"better,"
we mean "incorporating your own ideas")
- Implement collaborative filtering technique on certain basket/item data (from Ebay or Amazon,
for
instance)
- Implement key components of a vertical search engine (e.g. crawler)
- Implement a meta-search engine that post-processes Google and Yahoo's results
- Implement a meta-search engine that queries multiple databases and meshes and presents their
results in a meaningful way
- Implement a heuristic/algorithm for ranking Stanford domain web pages
- Design and implement frequent itemset identification by better 1-pass algorithms
- Design and implement your alternative to locality-sensitive hashing algorithm
- Take a shot at the $1,000,000 Netflix Prize. See this NY Times article
(local_cache) for more info.
Resources: Stanford WebBase project. Find description here. Find how to access web
pages in the
repository here.