Tuesday, September 16, 2008

Reading about Hadoop

Tom White's book on Hadoop is up on Rough Cuts. If you aren't familiar with Apache Hadoop, it's Doug Cutting's effort to go beyond Lucene and build an open source implementation of much of the other Google infrastructure.

There is a two day Hadoop Camp at the upcoming ApacheCon in New Orleans. Learning about Hadoop is a great way to become familiar with some of the innovations that Google has put forward in the last few years- and to see the technology behind Yahoo's big set of nodes. What follows is the beginning of Mr. White's book, I am looking forward to the chapter on HBase.

"Hadoop was created by Doug Cutting, the creator of Lucene, the widely-used text search library. Hadoop has its origins in Nutch, an open source web search engine, itself a part of the Lucene project.

Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, it is also a challenge to run without a dedicated operations team — there are lots of moving parts. It's expensive too — Cutting and Cafarella estimated a system supporting a 1-billion-page index would cost around half a million dollars in hardware, with a $30,00 monthly running cost. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratise search engine algorithms.

Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn't scale to the billions of pages on the web. Help was at hand with the publication of "The Google File System" (GFS) [ref] in 2003. This paper described the architecture of Google's distributed filesystem that was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004 they set about writing an open source implementation, the Nutch Distributed File System (NDFS) as it became to be known.

NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. Yahoo! hired Doug Cutting and, with a dedicated team, provided the resources to turn Hadoop a system that ran at web scale. This was demonstrated in February 2008 when Yahoo! announced that their production search index was being generated by a 10,000 core Hadoop cluster.

Earlier, in January 2008, Hadoop was made a top-level project at Apache, confirming its success and dynamic community."

blog comments powered by Disqus