Monday 18 August 2014

Big Data and the Hadoop Ecosystem

Everyone says it — we are living in the era of “Big Data.” Chances are that you have heard this
phrase. In today’s technology-fueled world where computing power has significantly increased,
electronic devices are more commonplace, accessibility to the Internet has improved, and users
have been able to transmit and collect more data than ever before.
Organizations are producing data at an astounding rate. It is reported that Facebook alone
collects 250 terabytes a day. According to Thompson Reuters News Analytics, digital data
production has more than doubled from almost 1 million petabytes (equal to about 1 billion
terabytes) in 2009 to a projected 7.9 zettabytes (a zettabyte is equal to 1 million petabytes) in
2015, and an estimated 35 zettabytes in 2020. Other research organizations offer even higher
estimates!
As organizations have begun to collect and produce massive amounts of data, they have
recognized the advantages of data analysis. But they have also struggled to manage the massive
amounts of information that they have. This has led to new challenges. How can you
effectively store such a massive quantity of data? How can you effectively process it? How
can you analyze your data in an efficient manner? Knowing that data will only increase,
how can you build a solution that will scale?









These challenges that come with Big Data are not just for academic researchers and data scientists.
In a Google+ conversation a few years ago, noted computer book publisher Tim O’Reilly made
a point of quoting Alistair Croll, who said that “companies that have massive amounts of data
without massive amounts of clue are going to be displaced by startups that have less data but more
clue …” In short, what Croll was saying was that unless your business understands the data it has, it
will not be able to compete with businesses that do.
Businesses realize that tremendous benefits can be gained in analyzing Big Data related to business
competition, situational awareness, productivity, science, and innovation. Because competition
is driving the analysis of Big Data, most organizations agree with O’Reilly and Croll. These
organizations believe that the survival of today’s companies will depend on their capability to store,
process, and analyze massive amounts of information, and to master the Big Data challenges.
If you are reading this book, you are most likely familiar with these challenges, you have some
familiarity with Apache Hadoop, and you know that Hadoop can be used to solve these problems.
This chapter explains the promises and the challenges of Big Data. It also provides a high-level
overview of Hadoop and its ecosystem of software components that can be used together to build
scalable, distributed data analytics solutions.

No comments:

Post a Comment