CS345 Autumn, 2005: Data Mining.

Course Info | Handouts | Assignments | Project | Course Outline | Resources and Reading | Frequently Asked Questions


Course Information

Instructors: Anand Rajaraman (anand @ cs dt stanford dt edu), Jeffrey D. Ullman (ullman @ gmail dt com).

TA: Robbie Yan (xyan @ stanford dt edu).

Email Address for Questions: cs345a-aut0506-staff @ lists dt stanford dt edu (This is the best way to reach all three of us simultaneously)

Meeting: MW 3:15 - 4:30PM; Room: Education 128. Would you believe our room got changed again? This one is right across the mall from the place we met the first time. We hope it will be the last change.

Office Hours: Instructors will be available after classes that they teach. Jeff is in 433 Gates and Anand in 438 Gates. Robbie's office hours: 2-5PM Tuesdays in B26B Gates.

Prerequisites: CS145 or equivalent.

Materials: There is no text, but students will use the Gradiance automated homework system for which a nominal fee will be charged. Notes and/or slides will be posted on-line. You can see earlier versions of the notes and slides covering Data Mining. Not all these topics will be covered this year.

Requirements: There will be periodic homeworks (some on-line, using the Gradiance system), a final exam, and a project on web-mining, using the Stanford WebBase. The homework will count just enough to encourage you to do it, about 20%. The project and final will account for the bulk of the credit, in roughly equal proportions.


Handouts

Note: The slides labeled as "for Anand's lecture" are authored by Anand Rajaraman.

  1. Introduction to Data Mining Powerpoint Slides for Jeff's half of 9/26 lecture. [PDF]

  2. Introduction to Web Mining Powerpoint Slides for Anand's half of 9/26 lecture and part of 9/28 lecture. [PDF]

  3. Mining Frequent Pairs; A-Priori Powerpoint Slides for Jeff's 9/28 lecture. [PDF]

  4. Improved Methods for Frequent-Pair Mining Powerpoint Slides for Jeff's 10/3 lecture. [PDF]

  5. Web Crawling Powerpoint slides for Anand's 10/5 lecture. [PDF]

  6. PageRank and Hubs/Authorities Powerpoint Slides for Jeff's 10/10 and 10/12 lectures. [PDF]

  7. Minhashing and Locality-Sensitive Hashing Powerpoint Slides for Jeff's 10/12 and 10/17 lectures. [PDF]

  8. Topic-Specific PageRank Powerpoint Slides for Anand's 10/19 lecture. [PDF]

  9. Web Spam Powerpoint Slides for Anand's 10/24 lecture. [PDF]

  10. Introduction to Stream Mining Powerpoint Slides for Jeff's 10/26 lecture. [PDF]

  11. Stream Mining, Estimating Frequencies Powerpoint Slides - Set #1 and Powerpoint Slides - Set #2 for Jeff's 10/31 and 11/2 lectures. [PDF] [PDF]

  12. Extracting Relational Data from the Web Powerpoint Slides for Anand's 11/07 lecture. [PDF]

  13. Virtual Databases Powerpoint Slides - Set #1 and Powerpoint Slides - Set #2 for Anand's 11/09 lecture. [PDF] [PDF]

  14. Introduction to Clustering, k-Means Powerpoint Slides for Jeff's 11/14 and 11/16 lectures. [PDF]

  15. More on Clustering Powerpoint Slides for Jeff's 11/28 lecture. [PDF]

  16. Optimizing Selection of Ads Powerpoint Slides for Anand's 11/30 lecture. [PDF]


Assignments

Some of the homework will be on the Gradiance system. You should go there to open your account, and enter the class code that will be told to you in class. You can try the work as many times as you like, and we hope everyone will eventually get 100%. The secret is that each of the questions involves a "long-answer" problem, which you should work. The Gradiance system gives you random right and wrong answers each time you open it, and thus samples your knowledge of the full problem. While there are ways to game the system, we group several questions at a time, so it is hard to get 100% without actually working the problems. Also notice that you have to wait 10 minutes between openings, so brute-force random guessing will not work.

Project

CS345A Project specification:

Course Outline

Some of the topics to be covered in the data-mining section are:

Data Mining

Web Mining

Here is a Tentative Schedule for the course. It will be updated periodically.


Resources and Readings


Frequently Asked Questions

  1. For any given homework, which score on Gradiance counts? The best one or the last one?

    The last one. After the homework closes, you will be allowed to view your last submission with solutions. Also, if you wish, you can then play with the assignment again, e.g. to practice for an exam, but the score will not count. Note, however, that if you don't submit at least once, you don't get to see solutions.

  2. Why should we believe that, just because no itemset in the negative border is frequent in the full dataset, there is no other itemset that is not frequent in the sample, yet is frequent in the full dataset?

    The proof is not trivial. It can be found on Toivonen's paper. You can prove it yourself by a careful induction on the size of an itemset S that either:

          1. S is frequent in the sample, or
          2. S contains a member of the negative border.

    (note that 2 includes the case where S *is* a member of the negative border.)

    We will leave the proof to you. But once you have that, suppose S is frequent in the full dataset but not in the sample. Then (2) holds, and by monotonicity, there is a member of the negative border that is frequent in the full dataset.