Hadoop TDG 3 – MR Job

现在的位置: 首页 > 综合 > 正文

RSS

Hadoop TDG 3 – MR Job

2012年09月17日 ⁄ 综合 ⁄ 共 33891字 ⁄ 字号小中大 ⁄ 评论关闭

文章目录

Classic MapReduce (MapReduce 1)
YARN (MapReduce 2)
Failures in Classic MapReduce
The Map Side
The Reduce Side
Configuration Tuning
The Task Execution Environment
Speculative Execution, 投机执行
Output Committers
Task JVM Reuse
Skipping Bad Records

Anatomy of a MapReduce Job Run

Classic MapReduce (MapReduce 1)

A job run in classic MapReduce is illustrated in Figure 6-1. At the highest level, there are four independent entities:

• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing job files between the other entities.

Job Submission, 将需要的文件放到HDFS

The submit() method on Job creates an internal JobSummitter instance and calls submitJobInternal() on it (step 1 in Figure 6-1).
Having submitted the job, waitForCompletion() polls the job’s progress once a second and reports the progress to the console if it has changed since the last report. When the job is complete, if it was successful, the job counters are displayed. Otherwise, the error that caused the job to fail is logged to the console.

下面看看JobSummitter里面做了什么,
The job submission process implemented by JobSummitter does the following:
• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2). 从jobtracker 获取一个新的job id
• Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. 检查输出的目录是否valid
• Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program. 检查输入目录, 并将input数据切分为splits (默认64M)
• Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3). 把Job需要的文件,包括JAR, 配置, 输入data splits拷贝到HDFS中. 并且Jar文件会配置高复本数, 默认10, 原因是如果tasktrackers很多, 这样可以减少某个Jar复本的热点效应.

• Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4). 告诉jobtracker, job已经ready, 可以execution

Job Initialization, 根据splits和配置创建Task list

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it.
Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks’ status and progress (step 5).
Jobtracker把submitJobrequest放入internal queue中, job scheduler从queue取到j进行初始化, 创建Job对象, 将Job封装成Tasks, 并记录下用于track tasks状态和进程的信息.

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client from the shared filesystem (step 6).
It then creates one map task for each split.

The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the Job, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.

In addition to the map and reduce tasks, two further tasks are created: a job setup task and a job cleanup task.
These are run by tasktrackers and are used to run code to setup the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
The OutputCommitter that is configured for the job determines the code to be run, and by default this is a FileOutputCommitter.
For the job setup task it will create the final output directory for the job and the temporary working space for the task output, and for the job cleanup task it will delete the temporary working space for the task output.
The commit protocol is described in more detail in “Output Committers” on page 215.

Job scheduler首先从HDFS上取出所有input splits的list, 然后再创建task lists, 对于map task, 每个split创建一个就ok, reduce task的创建个数取决于配置
另外还有两种task, a job setup task and a job cleanup task, 顾名思义, 不解释了.

Task Assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages. As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value (step 7).
Tasktrackers会不断的往jobtracker发heartbeat, 除了告诉jobtracker我还活着, 还告诉jobtracker当前Tasktrackers的状况, 如果当前Tasktrackers有空闲资源, 可以run task, jobtracker会通过heartbeat 的返回, assign一个task.

Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from.
There are various scheduling algorithms as explained later in this chapter (see “Job Scheduling” on page 204), but the default one simply maintains a priority list of jobs.
Having chosen a job, the jobtracker now chooses a task for the job.

在选择task前, jobtracker需要先选择job, 有很多的选择方法, 默认就是按优先级选取.
Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
(The precise number depends on the number of cores and the amount of memory on the tasktracker; see “Memory” on page 305.)
The default scheduler fills empty map task slots before reduce task slots, so if the tasktracker has at least one empty map task slot, the jobtracker will select a map task; otherwise, it will select a
reduce task.
Tasktrackers根据他的处理能力(memory)来配置一定数量的map slots和reduce slots, 这个决定当前Tasktrackers可以run的map和reduce task的数量.
并且scheduler会默认先填满map slots, 再取考虑填reduce slots, 这是由于map不结束, 无法开始reduce, 所以map task肯定是优先的.

To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks, since there are no data locality considerations.

For a map task, however, it takes account of the tasktracker’s network location and picks a task whose input split is as close as possible to the tasktracker. In the optimal case, the task is data-local, that is, running on the same node that the split resides on. Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split. Some tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they are running on. You can tell the proportion of each type of task by looking at a job’s counters (see “Built-in Counters” on page 257).

对于map task的选取要考虑data locality, 最好split就存放在这个tasktracker上, 否则就寻找input split is as close as possible to the tasktracker.
但是现实往往不会那么理想, 你可以用job’s counters 来统计各种类型的task的比例, 来check task分配的效率.

Task Execution

Now that the tasktracker has been assigned a task, the next step is for it to run the task.
First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It also copies any files needed from the distributed cache by the application to the local disk; see “Distributed Cache” on page 288 (step 8). 把job JAR从HDFS下载到tasktracker, 同时把distributed cache里面需要的files也都copy下来

Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory. 将JAR解压到local目录

Third, it creates an instance of TaskRunner to run the task.
TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker (by causing it to crash or hang, for example). It is, however, possible to reuse the JVM between tasks; see “Task JVM Reuse” on page 216.
The child process communicates with its parent through the umbilical interface. This way it informs the parent of the task’s progress every few seconds until the task is complete.
Each task can perform setup and cleanup actions, which are run in the same JVM as the task itself, and are determined by the OutputCommitter for the job (see “Output Committers” on page 215). The cleanup action is used to commit the task, which in the case of file-based jobs means that its output is written to the final location for that task. The commit protocol ensures that when speculative execution is enabled (“Speculative Execution” on page 213), only one of the duplicate tasks is committed and the other is aborted.

创建TaskRunner, 它会launch新的VM(子进程)去run task, 为了防止task本身的bug导致tasktracker crash. 子进程会不断和父进程进行通信汇报执行状况. 考虑到不断创建JVM的耗费，可以优化为建立JVM的pool，节省不断创建和销毁的时间开销.
而且task也有setup and cleanup actions, 在cleanup中负责task commit, 能保证同一个task的多个temp(尝试), 只有一个被commmit

Progress and Status Updates

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run. Because this is a significant length of time, it’s important for the user to get feedback on how the job is progressing.

A job and each of its tasks have a status, which includes such things as the state of the job or task (e.g., running, successfully completed, failed), the progress of maps and reduces, the values of the job’s counters, and a status message or description (which may be set by user code).

These statuses change over the course of the job, so how do they get communicated back to the client?

Tasks also have a set of counters that count various events as the task runs (we saw an example in “A test run” on page 25), either those built into the framework, such as the number of map output records written, or ones defined by users.
If a task reports progress, it sets a flag to indicate that the status change should be sent to the tasktracker. The flag is checked in a separate thread every three seconds, and if set it notifies the tasktracker of the current task status.
Meanwhile, the tasktracker is sending heartbeats to the jobtracker every five seconds (this is a minimum, as the heartbeat interval is actually dependent on the size of the cluster: for larger clusters, the interval is longer), and the status of all the tasks being run by the tasktracker is sent in the call. Counters are sent less frequently than every five seconds, because they can be relatively high-bandwidth.
The jobtracker combines these updates to produce a global view of the status of all the jobs being run and their constituent tasks.

Finally, as mentioned earlier, the Job receives the latest status by polling the jobtracker every second. Clients can also use Job’s getStatus() method to obtain a JobStatus instance, which contains all of the status information for the job.
其实很简单, tasktracker会把task运行状况和counters通过heartbeats定期的发给jobtracker, jobtracker 汇总后显示给用户

Job Completion

When the jobtracker receives a notification that the last task for a job is complete (this will be the special job cleanup task), it changes the status for the job to “successful.”
Then, when the Job polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the waitForCompletion() method.
The jobtracker also sends an HTTP job notification if it is configured to do so. This can be configured by clients wishing to receive callbacks, via the job.end.notifica tion.url property.
Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do the same (so intermediate output is deleted, for example).

注意只有当job完成后才会去删除中间数据，如map的output，这是从容错考虑

YARN (MapReduce 2)

For very large clusters in the region of 4000 nodes and higher, the MapReduce system described in the previous section begins to hit scalability bottlenecks, so in 2010 a group at Yahoo! began to design the next generation of MapReduce. The result was YARN, short for Yet Another Resource Negotiator (or if you prefer recursive ancronyms, YARN Application Resource Negotiator).

YARN meets the scalability shortcomings of “classic” MapReduce by splitting the responsibilities of the jobtracker into separate entities.
The jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task progress monitoring (keeping track of tasks and restarting failed or slow tasks, and doing task bookkeeping such as maintaining counter totals).

YARN separates these two roles into two independent daemons:
a resource manager to manage the use of resources across the cluster,
and an application master to manage the lifecycle of applications running on the cluster.

The idea is that an application master negotiates with the resource manager for cluster resources—described in terms of a number of containers each with a certain memory limit—then runs applicationspecific processes in those containers. The containers are overseen by node managers running on cluster nodes, which ensure that the application does not use more resources
than it has been allocated.

In contrast to the jobtracker, each instance of an application—here a MapReduce job—has a dedicated application master, which runs for the duration of the application.
This model is actually closer to the original Google MapReduce paper, which describes how a master process is started to coordinate map and reduce tasks running on a set of workers.

As described, YARN is more general than MapReduce, and in fact MapReduce is just one type of YARN application.
There are a few other YARN applications—such as a distributed shell that can run a script on a set of nodes in the cluster—and others are actively being worked on (some are listed at http://wiki.apache.org/hadoop/PoweredByYarn). The beauty of YARN’s design is that different YARN applications can co-exist on the same cluster—so a MapReduce application can run at the same time as an MPI application, for example—which brings great benefits for managability and cluster utilization.

以后再研究, YARN首先可以应对超大集群, 4000 nodes以上的, 优化就是将Jobtracker的职能分开, 否则单点的负载过重, 无法应对大集群.

并且这个设计更接近Goolge的paper, 而且是一种更为general的设计, MapReduce只是作为他的一种应用...

Failures

In the real world, user code is buggy, processes crash, and machines fail. One of the major benefits of using Hadoop is its ability to handle such failures and allow your job to complete.

Failures in Classic MapReduce

In the MapReduce runtime there are three failure modes to consider: failure of the running task, failure of the tastracker, and failure of the jobtracker.

Task Failure

Consider first the case of the child task failing. The most common way that this happens is when user code in the map or reduce task throws a runtime exception.
If this happens, the child JVM reports the error back to its parent tasktracker, before it exits. The error ultimately makes it into the user logs. The tasktracker marks the task attempt as failed, freeing up a slot to run another task.

Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM bug that causes the JVM to exit for a particular set of circumstances exposed by the Map-Reduce user code.
In this case, the tasktracker notices that the process has exited and marks the attempt as failed.

Hanging tasks are dealt with differently. The tasktracker notices that it hasn’t received a progress update for a while and proceeds to mark the task as failed. The child JVM process will be automatically killed after this period. The timeout period after which tasks are considered failed is normally 10 minutes and can be configured on a per-job basis (or a cluster basis) by setting the mapred.task.timeout property to a value in milliseconds. Setting the timeout to a value of zero disables the timeout, so long-running tasks are never marked as failed. In this case, a hanging task will never free up its slot, and over time there may be cluster slowdown as a result. This approach should therefore be avoided, and making sure that a task is reporting progress periodically will suffice (see “What Constitutes Progress in MapReduce?” on page 193).

Task Fail有几种情况,
第一种, task本身出错抛异常, 这种case处理很简单, 报告给tasktracker, 将error写入user logs, tasktracker将这次task attempt置为failed, 然后free slot去跑其他的task.
第二种, JVM本身crash, tasktracker会detect到这种情况, 同样将这次task attempt置为failed
第三种, Hanging tasks, 这个比较复杂了, 没有反应的task, 处理方法是set timeout, 然后将没有反应的kill掉. 这边有个问题, 对于long-running tasks的处理, 应该分段reporting progress periodically, 而不应该简单的disables the timeout.

When the jobtracker is notified of a task attempt that has failed (by the tasktracker’s heartbeat call), it will reschedule execution of the task.
The jobtracker will try to avoid rescheduling the task on a tasktracker where it has previously failed.
Furthermore, if a task fails four times (or more), it will not be retried further.
This value is configurable: the maximum number of attempts to run a task is controlled by the mapred.map.max.attempts property for map tasks and mapred.reduce.max.attempts for reduce tasks.
By default, if any task fails four times (or whatever the maximum number of attempts is configured to), the whole job fails.
For some applications, it is undesirable to abort the job if a few tasks fail, as it may be possible to use the results of the job despite some failures. In this case, the maximum percentage of tasks that are allowed to fail without triggering job failure can be set for the job.
Map tasks and reduce tasks are controlled independently, using the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.

Task Fail后怎么处理?
对于失败的task, jobtraker需要reschedule, 而且会尽量避免schedule到上次失败的node上
默认情况下对于一个task, jobtraker最多进行四次task attempt, 如果都失败, 那么该task就算彻底失败了, 而且一般, 任何task失败, 都会导致整个job的失败.
对于一些应用如果可以容忍部分task的失败, 可以通过配置来配置容忍失败的比例. 这样只要是低于该比例的task失败, job本身不会失败.

A task attempt may also be killed, which is different from it failing.
A task attempt may be killed because it is a speculative duplicate (for more, see “Speculative Execution” on page 213), or because the tasktracker it was running on failed, and the jobtracker marked all the task attempts running on it as killed. Killed task attempts do not count against the number of attempts to run the task (as set by mapred.map.max.attempts and mapred.reduce.max.attempts), since it wasn’t the task’s fault that an attempt was killed. Users may also kill or fail task attempts using the web UI or the command line (type hadoop job to see the options). Jobs may also be killed by the same mechanisms.

Task除了本身失败, 还可能被外部kill. 这个情况很多, 比如Speculative Execution, 只要有一个task成功, 其余的都会被kill, 还有可能tasktracker, jobtracker 或user 通过UI kill.
但是注意, task被kill不算task number of attempts(默认4次), 因为这不是由于task本身的原因fail的

Tasktracker Failure

Failure of a tasktracker is another failure mode.
If a tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently). The jobtracker will notice a tasktracker that has stopped sending heartbeats (if it hasn’t received one for 10 minutes, configured via the mapred.task tracker.expiry.interval property, in milliseconds) and remove it from its pool of tasktrackers to schedule tasks on.
The jobtracker arranges for map tasks that were run and completed successfully on that tasktracker to be rerun if they belong to incomplete jobs, since their intermediate output residing on the failed tasktracker’s local filesystem may not be accessible to the reduce task. Any tasks in progress are also rescheduled.

当tasktracker fail, jobtracker可以简单的通过heartbeats来判断, 默认10 minutes(可配置)没收到heartbeats, 就认为tasktracker fail.
需要注意的是, 当tasktracker fail时, 在该tasktracker上执行成功的map task(incomplete jobs)也需要被arranges到其他tasktracker上重新执行, 因为map的输出作为中间结果是存储在tasktracker’s local filesystem上, 所以tasktracker fail会导致结果取不到. 当然其他没有完成的map或reduce task, 当然需要jobtracker重新schedule.

A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed.
If more than four tasks from the same job fail on a particular tasktracker (set by (mapred.max.tracker.failures), then the jobtracker records this as a fault.
A tasktracker is blacklisted if the number of faults is over some minimum threshold (four, set by mapred.max.tracker.blacklists) and is significantly higher than the average number of faults for tasktrackers in the cluster.
Blacklisted tasktrackers are not assigned tasks, but they continue to communicate with the jobtracker. Faults expire over time (at the rate of one per day), so tasktrackers get the chance to run jobs again simply by leaving them running. Alternatively, if there is an underlying fault that can be fixed (by replacing hardware, for example), the tasktracker will be removed from the jobtracker’s blacklist after it restarts and rejoins the cluster.

tasktracker虽然本身没有fail, 按时发heartbeats, 但是在上面执行task时, 却总是失败, 这有可能是tasktracker本身存在的问题(软件或硬件)导致的, 这种情况会大大降低job执行效率和增加jobtracker的负荷, 所以jobtracker增加tasktracker blacklisted的功能去block这些有问题的tasktracker.
blacklist中的tasktracker虽然不会被分配task, 但是还保持和jobtracker的communication, 当Faults expire over time, tasktracker会从blacklist里面被放出来, 继续工作.
同样如果tasktracker被restarts and rejoins the cluster, 也会被从blacklist里面被放出来.
这儿考虑的情况很多, 应该这样的设计是来自于实践经验.

Jobtracker Failure

Failure of the jobtracker is the most serious failure mode.
Hadoop has no mechanism for dealing with failure of the jobtracker—it is a single point of failure—so in this case the job fails. However, this failure mode has a low chance of occurring, since the chance of a particular machine failing is low.

The good news is that the situation is improved in YARN, since one of its design goals is to eliminate single points of failure in Map-Reduce.
After restarting a jobtracker, any jobs that were running at the time it was stopped will need to be re-submitted. There is a configuration option that attempts to recover any running jobs (mapred.jobtracker.restart.recover, turned off by default), however it is known not to work reliably, so should not be used.

对于jobtracker fail的情况, 当前没有策略去应对, 但在YARN有所改善

Job Scheduling

Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order of submission, using a FIFO scheduler.

Later on, the ability to set a job’s priority was added, via the mapred.job.priority property or the setJobPriority() method on JobClient (both of which take one of the values VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). When the job scheduler is choosing the next job to run, it selects one with the highest priority.

这种优先级, 还是要等上个Job结束后, 才能选折高优先级的执行, 所以还不是动态的schedule

MapReduce in Hadoop comes with a choice of schedulers. The default in MapReduce 1 is the original FIFO queue-based scheduler, and there are also multiuser schedulers called the Fair Scheduler and the Capacity Scheduler. MapReduce 2 comes with the Capacity Scheduler (the default), and the FIFO scheduler.

Shuffle and Sort

MapReduce makes the guarantee that the input to every reducer is sorted by key.
The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.

In this section, we look at how the shuffle works, as a basic understanding would be helpful, should you need to optimize a Map-Reduce program.

对于shuffle, sort的理解, 为啥叫这个名字?
个人理解, map端需要把数据按接收的reducer进行partitioin,sort, 有点象shuffle的过程
在reducer端, 来自不同reducer的数据, 经行多路并归排序, 称为sort
看似说是只要保证reducer端的输入是按key排序, 比较简单, 但是对于多个node的大量数据, 怎么样保证这点? 需要在map和reduce端做很多工作才能达到这个目标.

The Map Side

When the map function starts producing output, it is not simply written to disk.
The process is more involved, and takes advantage of buffering writes in memory and doing some presorting for efficiency reasons.

Each map task has a circular memory buffer that it writes the output to. The buffer is 100 MB by default, a size which can be tuned by changing the io.sort.mb property.
When the contents of the buffer reaches a certain threshold size (io.sort.spill.percent, default 0.80, or 80%), a background thread will start to spill the contents to disk.
Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
Spills are written in round-robin fashion to the directories specified by the mapred.local.dir property, in a job-specific subdirectory.
首先Map的输出是先写到一共循环memory buffer里面的, 当buffer快满时, 后台线程会将数据spill到disk上. 同时spill到disk的操作不会影响mapper往buffer里面写数据, 除非buffer满了.
其实这个机制挺像SSTable, 在内存里面buffer, sort, 然后spill到disk上的都是一个个局部有序的file, 最终再做多路并归去merge.

Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.
Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.
在写入disk前, 首先线程会先按reducer把数据进行partition, 然后对于每个partition, 在内存中进行按key sort.
如果定义了combiner, 会在存入disk前, 运行combiner, 用于减少需要存储的数据量.

Each time the memory buffer reaches the spill threshold, a new spill file is created, so after the map task has written its last output record there could be several spill files.
Before the task is finished, the spill files are merged into a single partitioned and sorted output file. The configuration property io.sort.factor controls the maximum number of streams to merge at once; the default is 10.
If there are at least three spill files (set by the min.num.spills.for.combine property) then the combiner is run again before the output file is written. Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.
每次从memory buffer spill到disk, 都会产生一个新的spill file.
当map task结束时, 需要把所有的spill files都merge到一个partitioned and sorted output file, 这就需要多路并归排序, 默认是10路并归.
这边比较有意思的是, 如果spill files的个数比较多的时候(默认大于3), 对最终merged的output file还要执行一遍combiner, 考虑的是尽量减少需要传输给reducer的数据量, 但是combiner的执行本身也会影响效率, 所以需要一个balance.
所以Combiner的设计必须满足的条件是, 无论Combiner执行与否, 执行多少次都不会改变最终结果, 因为系统不保证Combiner一定会被执行, 和执行多少次.

It is often a good idea to compress the map output as it is written to disk, since doing so makes it faster to write to disk, saves disk space, and reduces the amount of data to transfer to the reducer. By default, the output is not compressed, but it is easy to enable by setting mapred.compress.map.output to true. The compression library to use is specified by mapred.map.output.compression.codec; see “Compression” on page 85 for more on compression formats.
然后最终数据最好是压缩的, 并存储在本地文件系统上, reducer会通过HTTP来取mapper的输出数据.

The Reduce Side

Let’s turn now to the reduce part of the process.
The map output file is sitting on the local disk of the machine that ran the map task (note that although map outputs always get written to local disk, reduce outputs may not be), but now it is needed by the machine that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the
cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Reduce side首先就是copy phase, 需要把不同mapper node上自己partition的数据copy过来. 什么时候开始copy, 因为每个mapper执行情况不同, 所以当某一个mapper执行结束后, 就应该尽快进行数据copy. 为了提高效率, 应该可以配置mapper在没有执行完的时候先开始copy?
这儿就有问题, reduce怎么知道应该去哪些mapper node上copy数据?

How do reducers know which machines to fetch map output from?
As map tasks complete successfully, they notify their parent tasktracker of the status update, which in turn notifies the jobtracker.
These notifications are transmitted over the heartbeat communication mechanism described earlier. Therefore, for a given job, the jobtracker (or application master) knows the mapping between map outputs and hosts.
A thread in the reducer periodically asks the master for map output hosts until it has retrieved them all.
Hosts do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may subsequently fail. Instead, they wait until they are told to delete them by the jobtracker (or application master), which is after the job has completed.
map task的执行情况通过heartbeat从tasktracker传到jobtracker, 然后reducer会有个线程定期的去jobtracker ask mapper的情况.
并且map outputs 只要到job完全执行成功后, 才会被删除, 因为reducer可能会失败, 所以必须容错.

The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
As the copies accumulate on disk, a background thread merges them into larger, sorted files. This saves some time merging later on. Note that any map outputs that were compressed (by the map task) have to be decompressed in memory in order to perform a merge on them.
接着copy过来数据的预处理阶段, 从mapper端copy过来的数据足够小可以直接放在memory里面, 大的话就需要spill到disk上, 并且当spill的文件比较多的话, 后台的thread会直接把这些文件merge成larger, sorted files, 用于节省后面merge的时间. 如果定义了combiner, 还会再执行一下以提高存储效率.

When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This is done in rounds. For example, if there were 50 map outputs, and the merge factor was 10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), then there would be 5 rounds. Each round would merge 10 files into one, so at the end there would be five intermediate files.
Rather than have a final round that merges these five files into a single sorted file, the merge saves a trip to disk by directly feeding the reduce function in what is the last phase: the reduce phase. This final merge can come from a mixture of in-memory and on-disk segments.
然后是merge sort阶段, 默认是最大10路并归, 所以50个map outputs, 第一轮会产生five intermediate files.
注意, 这儿有个优化, merge sort的最后一轮不会往disk写, 而是直接作为reduce的输入, 其实也就是最后一轮不会真正做, 而是等reduce要开始的时候, 直接并归传给reducer.
所以针对这样的优化, 我们应该尽量减少磁盘读写, 所以上面的策略就不合适, 我们应该保证最后一轮达到最大merge factor(10) , 以减少前面的file merge
The goal is to merge the minimum number of files to get to the merge factor for the final round.
书上给的例子, 40个map outputs, 第一轮, 4, 10, 10, 10, 1,1,1,1,1,1 , 这样最后一轮就是10并归, 这样避免了后6个文件的第一轮并归?

During the reduce phase, the reduce function is invoked for each key in the sorted output. The output of this phase is written directly to the output filesystem, typically HDFS. In the case of HDFS, since the tasktracker node (or node manager) is also running a datanode, the first block replica will be written to the local disk.
最后执行的reduce, 并把结果写入HDFS

Configuration Tuning

We are now in a better position to understand how to tune the shuffle to improve MapReduce performance.

The general principle is to give the shuffle as much memory as possible. However, there is a trade-off, in that you need to make sure that your map and reduce functions get enough memory to operate.

On the map side, the best performance can be obtained by avoiding multiple spills to disk; one is optimal.

On the reduce side, the best performance is obtained when the intermediate data can reside entirely in memory.

具体的调优参数见书上的表格.

Task Execution

We saw how the MapReduce system executes tasks in the context of the overall job at the beginning of the chapter in “Anatomy of a MapReduce Job Run” on page 187.
In this section, we’ll look at some more controls that MapReduce users have over task execution.

The Task Execution Environment

Hadoop provides information to a map or reduce task about the environment in which it is running.

mapred.job.id , String , job_200811201130_0004
The job ID. (See “Job, Task, and Task Attempt IDs” on page 163 for a description of the format.)

mapred.tip.id , String, attempt_200811201130_0004_m_000003_0
The task ID. task_200811201130_0004_m_000003 mapred.task.id String The task attempt ID. (Not the task ID.)

mapred.task.partition , int
The index of the task within the job.

mapred.task.is.map, boolean
Whether this task is a map task.

Speculative Execution, 投机执行

The MapReduce model is to break jobs into tasks and run the tasks in parallel to make the overall job execution time smaller than it would otherwise be if the tasks ran sequentially.
This makes job execution time sensitive to slow-running tasks, as it takes only one slow task to make the whole job take significantly longer than it would have done otherwise. When a job consists of hundreds or thousands of tasks, the possibility of a few straggling tasks is very real.
Tasks may be slow for various reasons, including hardware degradation or software mis-configuration, but the causes may be hard to detect since the tasks still complete successfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup. This is termed speculative execution of tasks.

It’s important to understand that speculative execution does not work by launching two duplicate tasks at about the same time so they can race each other. This would be wasteful of cluster resources. Rather, a speculative task is launched only after all the tasks for a job have been launched, and then only for tasks that have been running for some time (at least a minute) and have failed to make as much progress, on average, as the other tasks from the job.

When a task completes successfully, any duplicate tasks that are running are killed since they are no longer needed. So if the original task completes before the speculative task, then the speculative task is killed; on the other hand, if the speculative task finishes first, then the original is killed.

Speculative execution is an optimization, not a feature to make jobs run more reliably.
If there are bugs that sometimes cause a task to hang or slow down, then relying on speculative execution to avoid these problems is unwise, and won’t work reliably, since the same bugs are likely to affect the speculative task. You should fix the bug so that the task doesn’t hang or slow down.

Why would you ever want to turn off speculative execution?
The goal of speculative execution is to reduce job execution time, but this comes at the cost of cluster efficiency.
On a busy cluster, speculative execution can reduce overall throughput, since redundant tasks are being executed in an attempt to bring down the execution time for a single job. For this reason, some cluster administrators prefer to turn it off on the cluster and have users explicitly turn it on for individual jobs. This was especially relevant for older versions of Hadoop, when speculative execution could be overly aggressive in scheduling speculative tasks.

There is a good case for turning off speculative execution for reduce tasks, since any duplicate reduce tasks have to fetch the same map outputs as the original task, and this can significantly increase network traffic on the cluster.
Another reason that speculative execution is turned off is for tasks that are not idempotent.
However in many cases it is possible to write tasks to be idempotent and use an OutputCommitter to promote the output to its final location when the task succeeds.

其实speculative execution, 比较容易理解, 首先这个feature应该来自实践, 一个map task非常slow会拖慢整个job的进度, 于是要解决这个问题.
并且这个map task的慢, 不是因为code本身的原因, 而是因为所在的node的软硬件环境导致的, 那么Hadoop的做法, 是发现这种慢task后, 会在其他的node再起一个相同的map task同时跑, 这样取先结束的那个的结果, 其余的task attemp会被kill.
所以这里, 慢, 是相对的, 相对相同job的其他map task.

但是在某些情况下, 开启speculative execution会产生效率问题, 比如对于busy cluster, reduce task, 非等幂的操作(做一次, 和多次, 结果不一样).
所以是否使用需要根据实际情况.

Output Committers

Hadoop MapReduce uses a commit protocol to ensure that jobs and tasks either succeed, or fail cleanly.
The behavior is implemented by the OutputCommitter in use for the job, and this is set in the old MapReduce API by calling the setOutputCommitter() on JobConf, or by setting mapred.output.committer.class in the configuration.
In the new MapReduce API, the OutputCommitter is determined by the OutputFormat, via its getOutputCommitter() method. The default is FileOutputCommitter, which is appropriate for file-based MapReduce. You can customize an existing OutputCommitter or even write a new implementation if you need to do special setup or cleanup for jobs or tasks.

public abstract class OutputCommitter {
    public abstract void setupJob(JobContext jobContext) throws IOException;
    public void commitJob(JobContext jobContext) throws IOException { }
    public void abortJob(JobContext jobContext, JobStatus.State state)
        throws IOException { }


    public abstract void setupTask(TaskAttemptContext taskContext)
        throws IOException;
    public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
        throws IOException;
    public abstract void commitTask(TaskAttemptContext taskContext)
        throws IOException;
    public abstract void abortTask(TaskAttemptContext taskContext)
        throws IOException;
   }
}

从类定义可见, 这个OutputCommitter就是用来做些, job和task, 初始化和善后工作的. 其中task commit比较复杂一下, 因为task会有很多attempt, 最终我们只需要commit一份结果

The commit phase for tasks is optional, and may be disabled by returning false from needsTaskCommit(). This saves the framework from having to run the distributed commit protocol for the task, and neither commitTask() nor abortTask() is called.

If a task succeeds then commitTask() is called, which in the default implementation moves the temporaray task output directory (which has the task attempt ID in its name to avoid conflicts between task attempts) to the final output path, ${mapred.output.dir}. Otherwise, the framework calls abortTask(), which deletes the temporary task output directory.

The framework ensures that in the event of multiple task attempts for a particular task, only one will be committed, and the others will be aborted. This suitation may arise because the first attempt failed for some reason—in which case it would be aborted, and a later, successful attempt would be committed. Another case is if two task attempts were running concurrently as speculative duplicates, then the one that finished first would be committed, and the other would be aborted.

Task JVM Reuse

Hadoop runs tasks in their own Java Virtual Machine to isolate them from other running tasks. The overhead of starting a new JVM for each task can take around a second, which for jobs that run for a minute or so is insignificant. However, jobs that have a large number of very short-lived tasks (these are usually map tasks), or that have lengthy initialization, can see performance gains when the JVM is reused for subsequent tasks.

Skipping Bad Records

The best way to handle corrupt records is in your mapper or reducer code. You can detect the bad record and ignore it, or you can abort the job by throwing an exception.

You can also count the total number of bad records in the job using counters to see how widespread the problem is.

In rare cases, though, you can’t handle the problem because there is a bug in a thirdparty library that you can’t work around in your mapper or reducer. In these cases, you

can use Hadoop’s optional skipping mode for automatically skipping bad records.

【上篇】php中文字符截取防乱码
【下篇】鼠标放上去则向上向下滚动的代码

作者: biceps

该日志由 biceps 于12年前发表在综合分类下，最后更新于 2012年09月17日.
转载请注明: Hadoop TDG 3 – MR Job | 学步园 +复制链接

抱歉!评论已关闭.

学步园