So what is Hadoop?
Only recently has this previously obscure platform become immensely
In a nutshell, Hadoop is an open source platform used for the
processing and storage of Big Data applications. Written in Java, the core
distribution of Hadoop is pretty simple and this is not that hard to
Hadoop has achieved major buzzword status and to many people, Hadoop
means Big Data. Since Hadoop is at the center of many Big Data
applications, there are many jobs out there available to developers who
Apache Hadoop is a software framework that consists of components
and other tools for analytical and large data warehousing applications, is
based on the concept of divide and conquer, it uses distributed processing
for computation and storage across different nodes in a cluster.
Hadoop is heavily used by Yahoo, Facebook and many other fortune 500
companies. Originally created in 2005, Hadoop has become immensely popular
and has become synonymous with Big Data.
Hadoop is an open source top-level Apache project, like other
top-level projects such as Linux, Hadoop is ever evolving as the global
community of developers and contributors are constantly adding features
and adding improvements.
So what does all this mean for the software developer?
There are two central components to the Hadoop framework.
First the MapReduce framework is designed to compute large volumes
of data in parallel. This requires clustering data across multiple
A MapReduce task involves two steps – mapping and reducing. The
algorithm for MapReduce is not new, but has become popular because of
First a mapping program reads and performs filtering and sorting,
such as sorting employees by last name into queues, then a reduce program
then acts as an aggregation step, perhaps summarizing names into like
names and counting them.MapReduce operates in parallel over many MapReduce
Mappers and reducers are unaware of each other and operate
independently. Next is the Hadoop Distributed File System, or the HDFS is
designed to be used by large file, the default file size of an HDFS
segment is 64 megabytes.
The HDFS has a transfer rate of 100 megabytes per second, which has
a seek time of about 10 milliseconds. It makes use of two daemons called
the master and the name node to manage files and maintain their metadata
across the cluster.
Hadoop was originally developed from the Google file system and
Google MapReduce, being open source, Hadoop is free and usage is covered
under the Apache 2.0 license.
One of the unique design features of Hadoop is it works on the
principle of moving computation near data rather than the traditional way
of transferring data to programs.
So who is using Hadoop?
Organizations use Hadoop for many purposes, to state a few :
Mobile communication and data management, mobile data storage
can be processed and analyzed to identify customer preferences Hadoop
can be used for predictive analytics for finding airline tickets or
the cheapest hotels near a location. Search engines will analyze an
individual's browsing history and produce unique results customized to
Google Earth uses Hadoop to obtain, save and analyze satellite
pictures. These pictures are categorized by street and Google Earth
panoramic views. Google also uses Hadoop to store, analyze and display
user reviews from the history of other web sites. In addition, Google
leverages Hadoop to display travel suggestions and directions for user
entered addresses complete with traffic conditions.
E-commerce industries for analyzing customer purchasing history
and to recommend products for customers.These customer records will
also be used by retailers and branded organizations for knowing the
current buying trends.This information could drive new product
offerings, forecasting, pattern recognition and deep historical data
analytics, these techniques can be used to forecast the weather,
identify patterns of fraudulent activity and perform deep historical
analysis of customer-purchasing trends. Also in image processing,
satellite images can be stored and analyzed.
Amazon uses Hadoop to store the purchase history of all of its
users so as to suggest to the new buyer the purchasing frequency of
items, suggesting new purchases, thus driving new sales. Amazon also
stores and analyzes the page history of all of its users providing the
feature "what other items do customers buy after viewing this item?" ,
With the notion of suggesting other products to customers.
PayPal uses Hadoop to ensure the security of customer data.
Purchasing history is saved and analyzed for specific behavioral
patterns for each customer. This eventually helps in the
identification of odd patterns to find fraud and theft. This results
in improved security of customer data and provides a high-quality data
In IT infrastructure and security, Hadoop is used to gather,
analyze, identify, and understand cyber malware attack patterns in order
to improve security.
Hadoop has created whole new disciplines in the Big Data field, new
job categories are appearing in the fields of data analytics, predictive
modeling, data mining, and cloud database development.
We will soon live in a world where everything that we do, from
eating breakfast to choosing which cellphone plan to buy, we will be
predicted based on an enormous amount of data stored and processed by