The following resources have been useful to me while learning about Web search (2007-2009). E-mail me with suggested additions.
- A Course on the Web Graph by Anthony Bonato. Heavy on theory that you’re unlikely to find elsewhere.
- Google’s PageRank and Beyond: The Science of Search Engine Rankings by Amy N. Langville and Carl D. Meyer. More numerical than the above book; includes some Matlab code.
- Numerical Linear Algebra by Lloyd N. Trefethen and David Bau. Teaches basic numerical linear algebra methods extremely well.
- A First Course on Numerical Methods, by Uri M. Ascher and Chen Greif. Goes into greater depth than the Trefethen book.
- Matrix Computations by Gene H. Golub and Charles F. Van Loan. A good reference.
- Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, and Padhraic Smyth. Covers a lot of interesting topics relating to ranking algorithms that you won’t find anywhere else in print.
- Matlab. You’ll want to get good at it.
Information retrieval and indexing
- Managing Gigabytes by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. A classic IR book; it’s ten years old. Goes into greater detail on index compression than any other IR book I’ve found. (My friends still make fun of me for reading a book called “Managing Gigabytes,” though. Let’s hope they don’t ever read this page…)
- Introduction to Information Retrieval (available free online) by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Has great broad overviews of many topics and good diagrams and examples for teaching the concepts, and includes useful annotated “References and further reading” sections at the end of each chapter.
- Building a Distributed Full-Text Index for the Web, a paper that discusses the design of an index spread across many servers, which is a topic that I feel hasn’t gotten nearly enough attention.
- Xapian, a C++ indexer with extremely good performance and a great Python API.
- ThruDB, a C++ “indexing and document storage service.”
- Tokyo Cabinet, an indexing suite written in C.
- You’ll probably want to use Berkeley DB for index storage at some point.
- Past papers from the Very Large Data Base (VLDB) conference. Click on the links to past years’ archives and scan the paper titles for ones that look interesting. Highlights include the Yahoo! PNUTS paper (which is also covered at Paper Trail)
- Past papers from the ACM Symposium on Operating Systems Principles (SOSP). Particularly Amazon’s Dynamo paper (Paper Trail coverage) and Google’s GFS (Google File System) paper.
- Past papers from the USENIX Symposium on Operating Systems Design and Implementation. Particularly Google’s Bigtable paper (also at Paper Trail).
- Open-source key-value databases: Scalaris (Erlang), LinkedIn’s Project Voldemort (Java), etc.
- Open-source Bigtable-style databases: HBase (Java), Hypertable (C++), Facebook’s Cassandra (Java), etc.
The conferences listed above (VLDB, SOSP, and OSDI) all have lots of database-related papers, too.
- Paper Trail has a good series on consensus protocols: two-phase commit, three-phase commit, Paxos, and a Paxos implementation.
- Google’s Chubby lock service paper.
- Paxos Made Simple, an optimistically titled paper from the inventor of Paxos, Leslie Lamport. Also, read the bizarre history of Paxos.
- Google’s “Paxos Made Live” paper has a fascinating discussion of the challenges of implementing Paxos. They even developed a terse language that compiles to C++ for constructing the Paxos state machine—just so all the code would fit on one screen.
- Consistent hashing, a good explanation by Tom White. There’s the original paper on consistent hashing, too.
Distributed computing (and MapReduce)
- Google’s MapReduce paper.
- A basic introduction to MapReduce by Michael Nielsen.
- Disco, a Python-Erlang MapReduce framework. Very easy to develop for (it’s just Python), but a bit hard to install and deploy still.
- Apache Hadoop. The most popular open-source distributed computing platform, but much more complicated than Disco. It’s written in Java.
PageRank and other ranking schemes
- The original PageRank paper, The PageRank Citation Ranking: Bringing Order to the Web.
- The HITS paper, Authoritative Sources in a Hyperlinked Environment. If you’re unfamiliar, check out the Wikipedia article on HITS.
- Two books mentioned above (under “The math”): A Course on the Web Graph by Anthony Bonato, and Google’s PageRank and Beyond: The Science of Search Engine Rankings by Amy N. Langville and Carl D. Meyer.
- A Survey of Eigenvector Methods of Web Information Retrieval by the authors of the above PageRank book. This paper covers PageRank and also HITS and SALSA, two lesser-known ranking schemes that you should know.
- A basic introduction to PageRank, simple Python PageRank code, and using MapReduce to compute PageRank, all by Michael Nielsen.
- Combating Web Spam with TrustRank, the application of some of the PageRank ideas to the fight against spam.
- BlockRank: Exploiting the Block Structure of the Web for Computing PageRank, a paper that introduces a method for speeding up PageRank calculation by considering entire hostnames as a “blocks” to calculate estimated PageRank values for each page, and then feeding the resulting PageRank vector into the standard algorithm. This technique speeds up convergence significantly.
- Mentioned above: Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, and Padhraic Smyth, covers a lot of interesting topics relating to ranking algorithms that you won’t find anywhere else in print.
- Efficient crawling through URL ordering. Covers how to prioritize crawling the most relevant Web pages.
- Stanford WebBase crawls the Web and provides free access to huge up-to-date dumps of Web pages to use as test data. You can choose from several kinds of crawls, choose how many pages you want, and filter by site, etc. It completely removes the need for you to create your own crawler. I’ve bolded its name here because it is so useful.
- Web Search for a Planet: The Google Cluster Architecture, a casual (non-academic) piece about Google’s server specs and cluster design.
- $130 USD Seagate 1.5 TB hard drive on Amazon.