Monday 18 August 2014

BIG DATA MEETS HADOOP

Citing “human capital” as an intangible but crucial element of their success, most organizations
will suggest that their employees are their most valuable asset. Another critical asset that is typically
not listed on a corporate balance sheet is the information that a company has. The power of an
organization’s information can be enhanced by its trustworthiness, its volume, its accessibility, and
the capability of an organization to be able to make sense of it all in a reasonable amount of time in
order to empower intelligent decision making.
It is very difficult to comprehend the sheer amount of digital information that organizations
produce. IBM states that 90 percent of the digital data in the world was created in the past two
years alone. Organizations are collecting, producing, and storing this data, which can be a strategic
resource. A book written more than a decade ago, The Semantic Web: A Guide to the Future of
XML, Web Services, and Knowledge Management by Michael Daconta, Leo Obrst, and Kevin T.
Smith (Indianapolis: Wiley, 2004) included a maxim that said, “The organization that has the best
information, knows how to find it, and can utilize it the quickest wins.”
Knowledge is power. The problem is that with the vast amount of digital information being
collected, traditional database tools haven’t been able to manage or process this information quickly
enough. As a result, organizations have been drowning in data. Organizations haven’t been able
to use the data well, and haven’t been able to “connect the dots” in the data quickly enough to
understand the power in the information that the data presents.
The term “Big Data” has been used to describe data sets that are so large that typical and traditional
means of data storage, management, search, analytics, and other processing has become a challenge.
Big Data is characterized by the magnitude of digital information that can come from many sources
and data formats (structured and unstructured), and data that can be processed and analyzed to
find insights and patterns used to make informed decisions.



What are the challenges with Big Data? How can you store, process, and analyze such a large
amount of data to identify patterns and knowledge from a massive sea of information?
Analyzing Big Data requires lots of storage and large computations that demand a great deal of
processing power. As digital information began to increase over the past decade, organizations
tried different approaches to solving these problems. At first, focus was placed on giving individual
machines more storage, processing power, and memory — only to quickly find that analytical
techniques on single machines failed to scale. Over time, many realized the promise of distributed
systems (distributing tasks over multiple machines), but data analytic solutions were often
complicated, error-prone, or simply not fast enough.
In 2002, while developing a project called Nutch (a search engine project focused on crawling,
indexing, and searching Internet web pages), Doug Cutting and Mike Cafarella were struggling
with a solution for processing a vast amount of information. Realizing the storage and processing
demands for Nutch, they knew that they would need a reliable, distributed computing approach that
would scale to the demand of the vast amount of website data that the tool would be collecting.
A year later, Google published papers on the Google File System (GFS) and MapReduce, an
algorithm and distributed programming platform for processing large data sets. Recognizing the
promise of these approaches used by Google for distributed processing and storage over a cluster
of machines, Cutting and Cafarella used this work as the basis of building the distributed platform
for Nutch, resulting in what we now know as the Hadoop Distributed File System (HDFS) and
Hadoop’s implementation of MapReduce.
In 2006, after struggling with the same “Big Data” challenges related to indexing massive amounts
of information for its search engine, and after watching the progress of the Nutch project, Yahoo!
hired Doug Cutting, and quickly decided to adopt Hadoop as its distributed framework for solving
its search engine challenges. Yahoo! spun out the storage and processing parts of Nutch to form
Hadoop as an open source Apache project, and the Nutch web crawler remained its own separate
project. Shortly thereafter, Yahoo! began rolling out Hadoop as a means to power analytics for
various production applications. The platform was so effective that Yahoo! merged its search and
advertising into one unit to better leverage Hadoop technology.
In the past 10 years, Hadoop has evolved from its search engine–related origins to one of the
most popular general-purpose computing platforms for solving Big Data challenges. It is quickly
becoming the foundation for the next generation of data-based applications. The market research
firm IDC predicts that Hadoop will be driving a Big Data market that should hit more than $23
billion by 2016. Since the launch of the first Hadoop-centered company, Cloudera, in 2008,
dozens of Hadoop-based startups have attracted hundreds of millions of dollars in venture capital
investment. Simply put, organizations have found that Hadoop offers a proven approach to Big Data
analytics.

No comments:

Post a Comment