topleft
topright
Enter the Member Network Zone View the Top 10 Points Leaderboard View Members Who Are Currently Online View Latest Member Activity

Featured Members


Member Network Zone

Expert Blog Comments

Keeping the Wheels Turning
In the end it all boils down to communication. If the Project manager communicates properly with the...
Top 5 Tech Trends to Watch for In 2012
It's true - no one is really surprised that cloud computing is on the list! Enjoyed reading about th...
Is Your CIO an Insomniac?
cross-posted comment to this article by Tony Campbell - I think that a CIO that doesn't plan for go...
Is Your CIO an Insomniac?
I sleep well at night. Jokes apart, a well balanced and empowered team (in-house or outsourced) redu...
Federal CIO Quits Post, Stalls Cloud Initiaves
Wow, I take this as a solid "No Vote" on the Cloud from the government! As much as they may look fo...
Bringing Hadoop to the Enterprise Print E-mail
Share This -
Digg
Delicious
Slashdot
Furl it!
Reddit
Spurl
Technorati
YahooMyWeb

For Cloudera, the exciting part of bringing Hadoop to the masses lies in going beyond the web companies that are already embracing it and capturing the attention of biotech, financial services, oil and gas, and other industries. In particular, the company is looking for organizations that face what it calls "Big Data" problems - some sort of analysis that conventional database technology can't address as efficiently, if at all.


I learned more about that in a conversation with Christophe Bisciglia, a former Google employee who was instrumental in introducing Hadoop to the academic community as a tool for teaching Google-style distributed computing techniques. Although Hadoop runs on an independently developed open source codebase, it mimics the style of distributed computing used within Google closely enough that the company recognized that it could be useful for training potential future employees. In partnership with the National Science Foundation, Google also hosts a Hadoop cluster in its own data center that's available for research use.


I'm writing this as a follow up to a previous post about Cloudera, a company providing commercial support to the open source Hadoop project for Google-style distributed computing, and I won't repeat the details I explained there about where it came from and how it works. Short version: Hadoop is a system for batch-oriented processing of large amounts of data, which can be spread over distributed storage system that can contain thousands of computers, with the processing of that data also distributed across all those computers. The distributed programming model is known as MapReduce.


Big web operations like Yahoo! and Facebook have adopted Hadoop and contributed to the open source project, which in turn is allowing smaller web companies to slipstream along behind them.


The best of those Web 2.0 companies are so chock full of engineering talent that they can figure out how to configure, deploy, and leverage the technology on their own, Bisciglia says. What Cloudera is focusing on is making it a little easier for the rest of us to get on board. "Cloudera is focused on issues related to enterprise deployment. Not everyone has resources of Google or Yahoo, so we need to ensure that Hadoop is easy to deploy and manage," he says.


The company's main product is a commercially supported distribution of the technology, packaged into RPM files so they can easily be loaded onto Red Hat Linux using Red Hat Package Manager (with support for other Linux distributions to follow). Other simple steps toward making the technology more manageable include providing web-based configuration wizards and integration with existing Linux administration tools for starting and stopping services. The open source business model is close to that of Red Hat Inc., meaning that the code is freely available from Cloudera and other sources but businesses are encouraged to purchase a support contract.


Cloudera also provides on-site training courses for organizations implementing Hadoop, but so far its fees are relatively low and the courses are more aimed at promoting the technology than bringing in significant revenue, Bisciglia says. "It quickly became available that the demand for training far exceeded our ability to capitalize on that demand," he says. So the company has also created videos of its training sessions, which it's giving away on its website.


Although examples of web data analysis tasks for which Hadoop is ideal are easiest to come by, the data crunching prowess of technologies like Hadoop will also prompt other industries to tackle data analysis challenges that were beyond their reach previously. For example, the biotech industry is starting to think in terms of not just sequencing the human genome but providing genetic analysis for specific individuals - to the point where drugs could be bioengineered to have the best effects and fewest side effects for specific individuals. Providing that kind of analysis within reasonable time and cost parameters is a Big Data problem where Hadoop could be part of the answer, Bisciglia says. Similarly, Hadoop could factor into producing better computer models of financial industry data or of seismic data about the ocean floor gathered by oil and gas exploration companies.


If I can get one of the organizations exploring these possibilities to talk, I'll let you know. Cloudera obviously has some faith that they're out there, or will be soon, because if not they won't have much of a business.


How difficult is it to learn MapReduce? Bisciglia says he doubts it's as big a jump as, say, teaching a Cobol programmer to work with Java and object-oriented programming. The programming model is most similar to that of functional languages such as Lisp, he says. The functions making up the MapReduce routine are actually themselves little programs distributed across all the nodes of a Hadoop cluster, and the MapReduce functions themselves can be written in any language - Java, C, Perl, whatever. They just have to follow a few core design principles. "If you adhere to that model and those semantics, then you can parallelize your software over as many machines as you need."


Typically, several MapReduce functions will be chained together to produce a given result. "You can think of individual MapReduce jobs as being like transformations," Bisciglia says. For starters, you might sort through some huge volume of website logs, break them up by session, and aggregate them by user. Then you might perform an analysis of the user database, and join the two data sources together to produce a given report. So your report on website usage patterns by category of user might involve 4 to 6 MapReduce transformations.


Learning the basics is not that hard, although it takes time and experience to become a proficient MapReduce programmer, Bisciglia says. One thing that helps in the transition is a Hadoop sub-project called HIVE that allows developers to construct queries using a version of SQL, the Structured Query Language of relational databases.


"It's true people don't want to learn new tricks. HIVE provides a bridge to all the existing models and technologies that analysts are familiar with. So people who haven't learned MapReduce yet can work with SQL," he says, and that will get them through until "they get to questions can't be answered with SQL."


Yet the Big Data distributed environment is significantly different from traditional programming in ways that newcomers sometimes fail to appreciate. If each MapReduce routine includes a request to a database or file server or web server, for example, then you could have hundreds or thousands of machines in the Hadoop cluster all making the same request to the same server at the same time. The right way to address that problem would be to first dump the database or other source into the Hadoop distributed file system so that each node gets a subset of the data to work on and can process it locally, without overloading the network or a centralized data server.


Otherwise, you wind up with something that looks like a hacker's distributed denial of service attack meant to crash the server at the other end of all those requests. "A lot of people forget, or don't realize, how much power they're wielding," Bisciglia says.




Comment on this article
RSS comments

Only registered users can write comments.
Please login or register.

 
Share This -
Digg
Delicious
Slashdot
Furl it!
Reddit
Spurl
Technorati
YahooMyWeb
< Previous   Next >




White Paper Library

Copyright © 2007-2012 CIOZones. All Rights Reserved. CIOZone is a property of PSN, Inc.