Plan du site  
pixel
pixel

Articles - Étudiants SUPINFO

Introduction to the MapReduce Life Cycle

Par Nezha EL GOURII Publié le 19/10/2016 à 16:36:02 Noter cet article:
(0 votes)
Avis favorable du comité de lecture

Introduction

In this topic I will show you how Hadoop uses MapReduce, describe the MapReduce life cycle, learn the role of the client job, use the job tracker, the task tracker, and know the workings of the map tasks and the reduce tasks. Understanding each of these processes and pieces is essential in learning Hadoop.

MapReduce life cycle

In addition to the HDFS, MapReduce is core part of Hadoop. Tools such as Pig and Hive are built on the MapReduce engine. So to learn these tools it is important to first learn MapReduce.

MapReduce jobs are written in Java. Developers with Java skills will pick up MapReduce faster than developers who are versed in other languages. The job client is a Java class that submits a MapReduce job. Depending on how it's used, the class is usually instantiated in the main method of a Java program. The job client provides methods to submit MapReduce jobs, track their progress, interact with other MapReduce tasks, and get the Hadoop MapReduce cluster information.

The steps in MapReduce job submission involve: validating both input and output specifications of the MapReduce job, computing the input splits for the MapReduce job.If needed, initializing the required accounting information for the distributed cache of the MapReduce job, on the HDFS, copying the MapReduce jobs jar and related configuration; and finally submitting the MapReduce job and monitoring its status to the job tracker.

The job tracker is a Hadoop MapReduce service that assigns MapReduce tasks to data nodes on the Hadoop cluster. It is initiated by the job client. The job tracker runs on the name node and contains the algorithm that determines which data node receives which processing portion of the MapReduce job.

The job tracker communicates with the task tracker on each of the data nodes. Each of the respective task trackers notify the job tracker or the health or death of the data node. So the job tracker may assign MapReduce tasks. Another feature of the job tracker is it attempts to move MapReduce tasks to where the data lives.

This is one of the pillars of Hadoop to move processing near the data rather than move the data to the processing. Task trackers run on the data nodes and are responsible for communicating with the job tracker. The job tracker is a single point of failure on the Hadoop MapReduce services. If a job tracker fails, all running MapReduce jobs fail. The task tracker monitors MapReduce processing within its own node. Task trackers are not aware of other data nodes and their respective task trackers. Remember data nodes only communicate with the name node and data nodes never communicate with each other.

Let's go over the map phase or the mapping life cycle.

  1. Inputs from the HDFS are fetched into the local copy according to the required input format.

  2. key-value pairs are created.

  3. The specific job according to the map function will be applied to other key pairs.

  4. The data will be sorted, aggregated, and combined locally.

  5. The results will be saved in the local file system.

  6. Once the task is done the task tracker will be informed.

  7. The task tracker informs the job tracker.

  8. The results will be passed onto the reduce phase by the job tracker. Now let's go over the reduce phase or the Reducing life cycle. The user may start the job after the map phase making the reduce phase optional. There are three main steps in the reduce phase. They are copy, sort, and merge.

Understanding the MapReduce Data Flow

MapReduce data flow

Now that we've covered the components that make up a MapReduce job, let's examine how data flows through the MapReduce process. In this part,I will show you how MapReduce handles and processes data. Also exploring the mapping and reducing steps in more detail and expand the vocabulary of the MapReduce data flow process.

Files in the HDFS are evenly replicated across all data nodes. When running a MapReduce job, it may involve running mapping task on many of the data nodes in the Hadoop cluster. Each mapping task running in the tracker of each data node is identical. They are not assigned special tasks. In this way, each mapper is generic and can process any file. Each mapper takes the files to be mapped, saves them local, and processes them. After the mapping phase is completed, the mapped key-value pairs must be exchanged between data nodes as to send to a single reducer.

The tasks in reducing are allocated across to the same data nodes on the Hadoop cluster. This exchange is the only time data nodes communicate with each other. The mapper performs the first real step in a MapReduce job. The mapper starts with a key-value pair. Mapper methods write the key-value pair, then forwards them to the reducers. Mappers are single threaded, so each mapper runs in its own instance known as a map task.

Mappers do not share mapping tasks. This task encapsulation guarantees that mapping tasks are not affected by the unreliability of the mapping process. When mapping tasks are complete, the job is sent into the reducing phase. Individual mapping task do not communicate with each other nor do they even know that each other exist. This is the same with reduce task. The client never passes data from one machine to another. All data movement is handled by the Hadoop framework through the job tracker on the name node. In this way, since Hadoop takes control of the entire process, the status of the job can be reliably monitored.

Hadoop is always monitoring the status of the cluster as it relates to running MapReduce jobs. Thus, it is able to route processing away from data nodes that have either become unreachable or troublesome. In the reduce phase, a single instance of each reducer is instantiated for each reduce task. The reduce phase occurs after the map phase of the MapReduce process. The key-value pairs are assigned to a reducer. The reducer calls its reducer method called reduce. This method receives a key, and also a key for the iteration over the values that are associated with the key.

These values that are associated with the key are returned in an undefined order by the iterator. Each reducer also receives an Output Collector and Reporter object. These objects are used in the same way as they are in the map method.

Mappers and reducers

In this part , we will see how data is formatted and subdivided before it is sent to a MapReduce job. Remember that MapReduce jobs are written in Java.

The Hadoop framework contains several derived versions of the InputFormat class. A few work directly with text files and have functionality in how to interpret text. The InputFormat has another important job. It has to divide the input data into smaller pieces before the data is fed into a mapper. These fragments called splits are split on the block boundaries and the underlying HDFS. Depending on the data, some files might be actually unsplittable.

Other sources of data like unsequential data from a database table does not lend itself to splitting as it's hard to determine where record starts and record ends. When accessing the data and dividing that data into splits, it is important that this is performed as efficiently as possible as the overall volume of data fed into a mapper is massive. If you like, you can create your own versions of the InputFormat class for custom formatting of input into your MapReduce programs – for example the TextInput class reads text files line by line. It admits a key for each record representing the byte offset.

The value is the line content up to the line feed character. If your data spans multiple lines and perhaps uses a delimiter, you can create your own implementation of InputFormat that parses your data into record splits based on that delimiter. Basically, by knowing how your data is structured, offset, and delimited will determine how you build your implementation of InputFormat. The TextInputFormat class parses files into splits by using byte offsets.

Next the class reads the file line by line from the split and feeds it into a mapper. The TextInputFormat class has an associated RecordReader that has to be flexible to the fact that the splits will rarely line up nicely to the line feed character or carriage returned. The way that data is split is important – for example the RecordReader will always read past the estimated end of a split and go to the line feed character in a record.

The next split in the file will have an associated reader and will scan the split and start to process that input fragment. All implementations of the RecordReader must have built-in logic that ensures that records span split boundaries are not missed. In a nutshell, InputFormat describes where the data originates from and how it is presented to the mapper. If the data source is on the local machine or the HDFS, most implementations of InputFormat will descend from FileInputFormat.

If your data comes from other sources, an InputFormat implementation can be implemented that reads data from that source. Some cloud databases have implementations of InputFormat that read records directly from a database table. In these systems, data is streamed from the database table directly into each networked machine. In this case, InputFormat reads this streamed data, parses it, and feeds it into a mapper.

A propos de SUPINFO | Contacts & adresses | Enseigner à SUPINFO | Presse | Conditions d'utilisation & Copyright | Respect de la vie privée | Investir
Logo de la société Cisco, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société IBM, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Sun-Oracle, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Apple, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Sybase, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Novell, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Intel, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Accenture, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société SAP, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Prometric, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo de la société Toeic, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management Logo du IT Academy Program par Microsoft, partenaire pédagogique de SUPINFO, la Grande École de l'informatique, du numérique et du management

SUPINFO International University
Ecole d'Informatique - IT School
École Supérieure d'Informatique de Paris, leader en France
La Grande Ecole de l'informatique, du numérique et du management
Fondée en 1965, reconnue par l'État. Titre Bac+5 certifié au niveau I.
SUPINFO International University is globally operated by EDUCINVEST Belgium - Avenue Louise, 534 - 1050 Brussels