References for introductory lecture
- Andrei Broder et al, "Graph Structure of the Web". WWW9 conference, 2000.
- Chris Anderson, "The Long Tail". Wired magazine, October 2004.
- Sergey Brin and Larry Page. The anatomy of a large scale hypertextual web search engine. WWW7, 1998.
- Lada A Adamic. "Zipf, Power-laws, and Pareto - a ranking tutorial."
- Lada A. Adamic and Bernardo A. Huberman. "Zipf's
law and the Internet." Glottometrics 3, 2002, 143-150.
Web Crawling
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient
Crawling Through URL Ordering." Computer Networks and ISDN
Systems, 30(1-7):161-172, 1998.
- Junghoo Cho, Hector Garcia-Molina
"Effective page refresh policies for Web crawlers."
ACM Transactions on Database Systems, 28(4): December 2003.
- Ka Cheung Sia, Junghoo Cho
"Efficient Monitoring Algorithm for Fast News Alert".
Technical report, UCLA, 2005.
- M. Najork and J. L. Wiener.
"Breadth-First Crawling Yields High-Quality
Pages." In Proceedings of the 10th International World Wide Web Conference,
pages 114--118, Hong Kong, May 2001
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig.
"Syntactic Clustering of the Web."
WWW6, 1997.
Page Rank, Hubs and Authorities
- Sergey Brin and Larry Page. The anatomy of a
large scale hypertextual web search engine. WWW7, 1998.
- J. Kleinberg. Authoritative
sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms,
1998. Extended version in Journal of the ACM 46(1999).
- Taher Haveliwala. Efficient Computation of PageRank.Technical Report, Stanford University, 1999.
- Taher Haveliwala.
Topic-Sensitive Page Rank. Proceedings of WWW11, 2002.
- Glen Jeh and Jennifer Widom. Scaling Personalized Web
Search. Proceedings of WWW12, 2003.
Web Spam
- Zoltán Gyöngyi, Hector Garcia-Molina.
Web Spam Taxonomy.
First International Workshop on Adversarial Information Retrieval on the
Web (at the 14th
International World Wide Web Conference), Chiba, Japan, 2005.
- Zoltán Gyöngyi, Hector Garcia-Molina and Jan Pedersen.
Combating Web Spam with TrustRank.
30th International Conference on Very Large Data Bases (VLDB),
Toronto, Canada, 2004.
- Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen.
Link Spam Detection Based on Mass Estimation.
32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, 2006.
paper,
presentation
Recommendation Systems
- G. Adomavicius and A. Tuzhilin. Towards the Next
Generation of Recommender Systems: A Survey of the State-of-the-Art
and Possible Extensions. IEEE TKDE, June 2005.
- Greg Linden, Brent Smith, and Jeremy York. Amazon.com
Recommendations: Item-to-Item Collaborative Filtering. IEEE
Internet Computing, Jan/Feb 2003.
- Sean McNee, John Riedl, Joseph A. Konstan. Accurate is not
always good: How accuracy metrics have hurt recommender systems.
ACM CHI 2006.
- Moses Charikar. Similarity Estimation Techniques from
Rounding Algorithms. ACM STOC 2002.
- Monika Henzinger. Finding Near-Duplicate Web Pages: A
Large-Scale Evaluation of Algorithms. ACM SIGIR 2006.
Relation Extraction
- Sergey Brin. Extracting Patterns
and Relations from the
World Wide Web. WebDB Workshop at 6th International Conference on
Extending Database Technology, EDBT'98, 1998.
- Eugene Agichtein and Luis Gravano.
Snowball: Extracting Relations from Large Plain-Text Collections
. Proceedings of the Fifth ACM International Conference on Digital
Libraries, 2000.
- S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng
(2002). P. Bennett, S. Dumais and E. Horvitz (2002).
Web question answering: Is more always better? In Proceedings of SIGIR'02, Aug 2002,
pp. 291-298.
Virtual Databases
- Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos.
Wrapper Induction for Information Extraction
.
Intl. Joint Conference on Artificial Intelligence (IJCAI), 1997.
- Anand Rajaraman, Jeffrey D. Ullman,
Querying Websites using Compact Skeletons.
Journal of Computer and System Sciences 66(4): 809-851 (2003).