Web search reading list

The following resources have been useful to me while learning about Web search (2007-2009). E-mail me with suggested additions.

The math

Information retrieval and indexing

  • Managing Gigabytes by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. A classic IR book; it’s ten years old. Goes into greater detail on index compression than any other IR book I’ve found. (My friends still make fun of me for reading a book called “Managing Gigabytes,” though. Let’s hope they don’t ever read this page…)
  • Introduction to Information Retrieval (available free online) by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Has great broad overviews of many topics and good diagrams and examples for teaching the concepts, and includes useful annotated “References and further reading” sections at the end of each chapter.
  • Building a Distributed Full-Text Index for the Web, a paper that discusses the design of an index spread across many servers, which is a topic that I feel hasn’t gotten nearly enough attention.
  • Xapian, a C++ indexer with extremely good performance and a great Python API.
  • ThruDB, a C++ “indexing and document storage service.”
  • Tokyo Cabinet, an indexing suite written in C.
  • You’ll probably want to use Berkeley DB for index storage at some point.

Databases

Distributed systems

The conferences listed above (VLDB, SOSP, and OSDI) all have lots of database-related papers, too.

Distributed computing (and MapReduce)

PageRank and other ranking schemes

Web crawling

  • Efficient crawling through URL ordering. Covers how to prioritize crawling the most relevant Web pages.
  • Stanford WebBase crawls the Web and provides free access to huge up-to-date dumps of Web pages to use as test data. You can choose from several kinds of crawls, choose how many pages you want, and filter by site, etc. It completely removes the need for you to create your own crawler. I’ve bolded its name here because it is so useful.

Systems

Other lists

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>