In this topic I will show you how Hadoop uses MapReduce, describe
the MapReduce life cycle, learn the role of the client job, use the job
tracker, the task tracker, and know the workings of the map tasks and the
reduce tasks. Understanding each of these processes and pieces is
essential in learning Hadoop.
In addition to the HDFS, MapReduce is core part of Hadoop. Tools
such as Pig and Hive are built on the MapReduce engine. So to learn these
tools it is important to first learn MapReduce.
MapReduce jobs are written in Java. Developers with Java skills will
pick up MapReduce faster than developers who are versed in other
languages. The job client is a Java class that submits a MapReduce job.
Depending on how it's used, the class is usually instantiated in the main
method of a Java program. The job client provides methods to submit
MapReduce jobs, track their progress, interact with other MapReduce tasks,
and get the Hadoop MapReduce cluster information.
The steps in MapReduce job submission involve: validating both input
and output specifications of the MapReduce job, computing the input splits
for the MapReduce job.If needed, initializing the required accounting
information for the distributed cache of the MapReduce job, on the HDFS,
copying the MapReduce jobs jar and related configuration; and finally
submitting the MapReduce job and monitoring its status to the job
The job tracker is a Hadoop MapReduce service that assigns MapReduce
tasks to data nodes on the Hadoop cluster. It is initiated by the job
client. The job tracker runs on the name node and contains the algorithm
that determines which data node receives which processing portion of the
The job tracker communicates with the task tracker on each of the
data nodes. Each of the respective task trackers notify the job tracker or
the health or death of the data node. So the job tracker may assign
MapReduce tasks. Another feature of the job tracker is it attempts to move
MapReduce tasks to where the data lives.
This is one of the pillars of Hadoop to move processing near the
data rather than move the data to the processing. Task trackers run on the
data nodes and are responsible for communicating with the job tracker. The
job tracker is a single point of failure on the Hadoop MapReduce services.
If a job tracker fails, all running MapReduce jobs fail. The task tracker
monitors MapReduce processing within its own node. Task trackers are not
aware of other data nodes and their respective task trackers. Remember
data nodes only communicate with the name node and data nodes never
communicate with each other.
Let's go over the map phase or the mapping life cycle.
Inputs from the HDFS are fetched into the local copy according
to the required input format.
key-value pairs are created.
The specific job according to the map function will be applied
to other key pairs.
The data will be sorted, aggregated, and combined
The results will be saved in the local file system.
Once the task is done the task tracker will be informed.
The task tracker informs the job tracker.
The results will be passed onto the reduce phase by the job
tracker. Now let's go over the reduce phase or the Reducing life
cycle. The user may start the job after the map phase making the
reduce phase optional. There are three main steps in the reduce phase.
They are copy, sort, and merge.
Understanding the MapReduce Data Flow
Now that we've covered the components that make up a MapReduce
job, let's examine how data flows through the MapReduce process. In this
part,I will show you how MapReduce handles and processes data. Also
exploring the mapping and reducing steps in more detail and expand the
vocabulary of the MapReduce data flow process.
Files in the HDFS are evenly replicated across all data nodes.
When running a MapReduce job, it may involve running mapping task on
many of the data nodes in the Hadoop cluster. Each mapping task running
in the tracker of each data node is identical. They are not assigned
special tasks. In this way, each mapper is generic and can process any
file. Each mapper takes the files to be mapped, saves them local, and
processes them. After the mapping phase is completed, the mapped
key-value pairs must be exchanged between data nodes as to send to a
The tasks in reducing are allocated across to the same data nodes
on the Hadoop cluster. This exchange is the only time data nodes
communicate with each other. The mapper performs the first real step in
a MapReduce job. The mapper starts with a key-value pair. Mapper methods
write the key-value pair, then forwards them to the reducers. Mappers
are single threaded, so each mapper runs in its own instance known as a
Mappers do not share mapping tasks. This task encapsulation
guarantees that mapping tasks are not affected by the unreliability of
the mapping process. When mapping tasks are complete, the job is sent
into the reducing phase. Individual mapping task do not communicate with
each other nor do they even know that each other exist. This is the same
with reduce task. The client never passes data from one machine to
another. All data movement is handled by the Hadoop framework through
the job tracker on the name node. In this way, since Hadoop takes
control of the entire process, the status of the job can be reliably
Hadoop is always monitoring the status of the cluster as it
relates to running MapReduce jobs. Thus, it is able to route processing
away from data nodes that have either become unreachable or troublesome.
In the reduce phase, a single instance of each reducer is instantiated
for each reduce task. The reduce phase occurs after the map phase of the
MapReduce process. The key-value pairs are assigned to a reducer. The
reducer calls its reducer method called reduce. This method receives a
key, and also a key for the iteration over the values that are
associated with the key.
These values that are associated with the key are returned in an
undefined order by the iterator. Each reducer also receives an Output
Collector and Reporter object. These objects are used in the same way as
they are in the map method.
In this part , we will see how data is formatted and subdivided
before it is sent to a MapReduce job. Remember that MapReduce jobs are
written in Java.
The Hadoop framework contains several derived versions of the
InputFormat class. A few work directly with text files and have
functionality in how to interpret text. The InputFormat has another
important job. It has to divide the input data into smaller pieces
before the data is fed into a mapper. These fragments called splits are
split on the block boundaries and the underlying HDFS. Depending on the
data, some files might be actually unsplittable.
Other sources of data like unsequential data from a database table
does not lend itself to splitting as it's hard to determine where record
starts and record ends. When accessing the data and dividing that data
into splits, it is important that this is performed as efficiently as
possible as the overall volume of data fed into a mapper is massive. If
you like, you can create your own versions of the InputFormat class for
custom formatting of input into your MapReduce programs – for example
the TextInput class reads text files line by line. It admits a key for
each record representing the byte offset.
The value is the line content up to the line feed character. If
your data spans multiple lines and perhaps uses a delimiter, you can
create your own implementation of InputFormat that parses your data into
record splits based on that delimiter. Basically, by knowing how your
data is structured, offset, and delimited will determine how you build
your implementation of InputFormat. The TextInputFormat class parses
files into splits by using byte offsets.
Next the class reads the file line by line from the split and
feeds it into a mapper. The TextInputFormat class has an associated
RecordReader that has to be flexible to the fact that the splits will
rarely line up nicely to the line feed character or carriage returned.
The way that data is split is important – for example the RecordReader
will always read past the estimated end of a split and go to the line
feed character in a record.
The next split in the file will have an associated reader and will
scan the split and start to process that input fragment. All
implementations of the RecordReader must have built-in logic that
ensures that records span split boundaries are not missed. In a
nutshell, InputFormat describes where the data originates from and how
it is presented to the mapper. If the data source is on the local
machine or the HDFS, most implementations of InputFormat will descend
If your data comes from other sources, an InputFormat
implementation can be implemented that reads data from that source. Some
cloud databases have implementations of InputFormat that read records
directly from a database table. In these systems, data is streamed from
the database table directly into each networked machine. In this case,
InputFormat reads this streamed data, parses it, and feeds it into a