topleft
topright
Enter the Member Network Zone View the Top 10 Points Leaderboard View Members Who Are Currently Online View Latest Member Activity

Featured Members


Member Network Zone

Expert Blog Comments

IT Worker Confidence Grows
Our lives revolve around technology and this does not surprise me. Good news!
Is Your Team Working Through Lunch?
Brilliant: this should be ENFORCED in all companies struggling to be social! Great read : bookmarked...
What Makes a Great Team Member?
This is so true! Our project management team, and some other people I know fit this description pe...
Hadoop Gets Corporate Backing for Google-like Open Source Technology Print E-mail
Share This -
Digg
Delicious
Slashdot
Furl it!
Reddit
Spurl
Technorati
YahooMyWeb

The New York Times has a good write-up on Cloudera, a company formed to produce an commercially supported distribution of Hadoop, the open source project that replicates key distributed computing technologies used internally at Google (see: Hadoop, a Free Software Program, Finds Uses Beyond Search).

Cloudera, a company made up of Hadoop project contributors, is betting that businesses with large-scale data analysis problems to solve will be in the market for someone to provide them with a supported distribution, training, support, and implementation guidance to leverage Hadoop, in the same way that open source technologies like MySQL and Red Hat Linux have companies standing behind them.

While Google has kept its internal implementation private, it has published a series of academic papers describing the computer science behind the design of its distributed file system and the MapReduce programming technique it applies to many ultra-large scale data processing tasks, such as indexing the web, and do it by spreading the work across large numbers of cheap computers. That was enough to allow Doug Cutting, a leading figure in several Apache open source projects, to create an independent implementation following the same general design. That's what became the Hadoop project. Google gave its tacit blessing, even after Cutting joined Yahoo!, which up to this point has been the major corporate backer of the Hadoop project. One reason is that Google would like to have knowledge of its distributed computing techniques to people it might like to hire and companies it might like to buy. Google and IBM even got together to sponsor academic training in distributed computing using Hadoop as the practice environment.

I wrote about the Yahoo-Hadoop connection a couple of years ago for Baseline Magazine (Yahoo Challenge to Google Has Roots in Open Source). At the time, I think Hadoop was something Yahoo was using mostly for back room data analysis of log files and such. Yahoo has since developed production applications that run on Hadoop, as some other prominent Internet businesses such as Facebook. Facebook in turn has contributed back code to a sub-project called HIVE, aimed at the development of data warehousing applications developed using a variant of SQL.

To interact with Hadoop at a more native level requires writing programs according to the MapReduce method. But although it may not be as familiar to large numbers of developers as SQL, what's got people excited about MapReduce it that it's much simpler than many other approaches to distributed computing. That is, you don't have to spend a lot of time deciding how a task will be broken down and what steps will be executed in parallel. Instead, you define two steps, the map (the preliminary search and index task), and the reduce (where the preliminary results are sorted and combined). The classic example, derived from search, is counting how often a word appears on the web. You parcel out your web crawl data to thousands of servers and dispatch many map jobs to count every occurrence of every word in each subset of the data and return a series of key-value pairs, where the key is the word ("the") and the value is the count (500). When all the map servers return their results, you can then sort by the keys and add the values to get the total count for each word. Several MapReduce tasks can be chained together to handle more complicated tasks.

The Hadoop runtime system takes charge of farming out the map tasks across tens, hundreds, or thousands of computers, tracking them so it knows when one of those computer dies before returning its results, restarting processes as necessary, and then bringing the intermediate results together to be reduced and produce an answer. For a more detailed overview, check out this presentation by Yahoo's Parand "Tony" Darugar (former Director of Architecture at Yahoo, was there in Nov. when this talk was given).

MapReduce does require learning a new style of data processing if you want to get the most out of the system. But there are also ways to simplify the challenge, such as taking advantage of the MapReduce plug-in to Eclipse produced by IBM, or taking advantage of Pig, another Hadoop sub-project that has created a high-level language for creating these programs with a simpler programming model.

I'm curious whether readers of CIOZone have experimented with Hadoop or, better yet, put it into production to address your own challenges. Contact me at This e-mail address is being protected from spam bots, you need JavaScript enabled to view it




Comment on this article
RSS comments

Only registered users can write comments.
Please login or register.

 
Share This -
Digg
Delicious
Slashdot
Furl it!
Reddit
Spurl
Technorati
YahooMyWeb
< Previous   Next >




White Paper Library

Copyright © 2007-2012 CIOZones. All Rights Reserved. CIOZone is a property of PSN, Inc.