现在的位置: 首页 > 综合 > 正文

关于gridmix3(Gridmix3 – Emulating Production Workload for Apache Hadoop)

2014年09月17日 ⁄ 综合 ⁄ 共 5466字 ⁄ 字号 评论关闭

转自:  

前言:http://blog.csdn.net/zhaomirong/article/details/7818922

Tracking, modeling, and mimicking this adaptive, complex workload is a prerequisite to effective performance engineering.

gridmix的意义:模拟和跟踪复杂的、适应性的复杂对性能测评先决意义


gridmix1-2的特性

Gridmix consists the following parts:

1、A data generation script that must be run once to generate data needed for running
the actual benchmark.

一个用于产生运行benchmark所需数据的脚本

2、Several types of jobs in the "mix". The particular jobs varied slightly between Gridmix and Gridmix2, but examples include sorts
of text data and SequenceFiles, jobs sampling from large, compressed datasets, chains of MapReduce jobs, and jobs exercising the combiner

混合的作业类型。特定的作业在gridmix和gridmix2间有细微的差别。但是都包括文本数据和sequencefile排序、大数据量取样、压缩数据集、Mapreduce chain和运用了combiner的作业例子。

3、Configuration files- either XML files or shell scripts- that permitted the operator to tune the number of jobs in each type and
of each size

用xml或者shell脚本的配置文件形式,提供调整各类作业数量和规模的借口

4、A driver that parses the configuration file, submits MapReduce jobs to the cluster, and (in Gridmix2) collects statistics for each
job.

一个driver:解析配置文件,提交Mapreduce作业,搜集统计数据


While running Gridmix as an end-to-end benchmark yielded many insights into framework bottlenecks in medium-sized clusters under
a saturating load, it does not model the diverse mix of jobs running in our environment. On the contrary, framework improvements motivated by studies based on Gridmix often had ambiguous effects in production after showing dramatic gains in a test environment
.

用gridmix作为测评工具时,其实隐藏了在集群搞复杂情况下的一些性能瓶颈,他没有模拟出混合的复杂多样的作业。相反,根据gridmix的改进的框架性能,往往在测试环境可以得到很好的收益,却在真实环境中不尽人意。

So it is critical to not only model the full range of jobs on the cluster, but also the co-incidence of
jobs on the grid to accurately identify and reproduce load-related bottlenecks.

所以,构造各种类型,并且高并发的作业来准确的找出集群在负载方面的瓶颈是非常重要的。


Gridmix3

The above observations motivated us to develop a new workload generator, Gridmix3, to narrow
the following gaps between the existing benchmark suite and the load we're interested in:

gridmix3与现有的benchmark工具相比,负载构造方面会在以下方面改进

1、Task distribution: The number of tasks in a job varies widely in Yahoo! clusters, from a
single task to hundreds of thousands.

task分布:Job中task数量的分布,从一个到几十万不等

2、Submission interval:. Users are not nearly as predictable as the saturating benchmarks. We observe
cyclic and bursty patterns that stress the JobTracker and the cluster in ways not seen in the synthetic load

作业提交的频率:用户的行为是难以预测的。我们观察到在JobTracker上有模拟的负载中没有的周期式和爆炸式压力。

3、Input dataset:The data processed by Gridmix jobs are distributed over a small subset of the blocks
in a system, with artificial hotspots and a distribution unlike what is measured in practice.

被gridmix作业处理的数据只覆盖了系统所有block中的一小部分,这种人工模拟造成的热点和分布和实际情况差别较大

4、User diversity: The JobTracker schedulers are far more interesting under multiple user workloads.

多用户:多用户作业负载对JobTracker调度的影响比较大(应该是多用户对应多队列或多优先级)

5、Job complexity: User jobs are not the simple, I/O bound sorts of saturating benchmarks, pressing
the low-level limits of the framework.

作业复杂性:用大量的IO密集作业来压力集群的处理能力底线


Instead, we elected to build a benchmark that can reproduce the workload by replaying traces that
automatically capture essential ingredients of job executions.

之前bla了几句由于集群的负载是不停变化的,从集群某一时刻的静态情况去决定模拟的job类型是不科学的,所以,选择用这样的方式:

用追踪作业执行过程中关键的步骤来重新构造job的执行用以建立benchmark。


The benchmark takes as input a job trace, essentially a stream of JSON-encoded job descriptions derived
from artifacts collected on a cluster (such as job history logs and configuration files).

benchmark接收一个jobtrace作为输入。jobtrace本质上是一个json格式的从集群中搜集的job描述数据(比如history logs和configuration file)

For each job, the submitting client obtains the original job submission time, the memory allocated
to each task, and- for each task in that job- the byte and record counts read and written.

gridmix的客户端会获取每个任务的原始任务的提交时间、每个task的内存分配、从counter中得到每个task处理和输出的数据量。

Given this data, it constructs a synthetic job with the same byte and record patterns recorded in
the trace and submits each at a matching interval, "replaying" that job on the cluster.The synthetic job will generate a comparable load on the I/O subsystems in the test grid as the original did in production.

从这些数据中,客户端构造一个模拟job,有同样的处理数据量和数据格式以及任务提交间隔,将job在cluster上重现。模拟的作业可以在测试集群上产生与实际集群上类似的IO负载。

As our models for synthetic jobs improve, Gridmix3 runs will capture interactions between jobs- for
example, two I/O-bound tasks running on the same node- with accuracy sufficient for analysis and validation of code changes. We are in the process of extending the synthetic job mix to model CPU usage, memory, job dependencies, etc. as data become available.
Since users often rely on core libraries and frameworks built on top of MapReduce, we also plan to include facilities for adding representatives of these to the job mix.

由于模拟Job的模型的改进,gridmix3可以捕获作业之间所产生的相互影响,可以更准确的分析问题和验证改进。gridmix还在扩展基于CPU利用率、内存、作业依赖等模型的job混合并发。也计划增加自定义的job混合并发模型接口。


JobTrace:

The job trace is derived from job execution history logs collected by the JobTracker, through a tool
called Rumen.Rumen not only parses job history logs, it also provides several means to adjust the density of the trace to match the size of the testing cluster, such as sampling or trace overlay (a method for interleaving segments of job traces, so one may,
for example, take two weeks from one cluster and combine them, capturing cyclic effects in the augmented load). Future work in scaling cluster workloads to smaller/larger clusters and extracting job properties from cluster artifacts will be integrated into
this tool.

jobtrace是从JobTracker的job执行历史日志中提取的用Rumen提取的。Rumen不仅能解析job历史日志,还提供了多种途径根据集群规模做采样(比如可以抽取一个集群两周的日志做合并,以获取周期性方面的影响)。后续还会有更多的衡量集群规模来和提取作业属性的功能会在这个工具里合成。


Where We are Today?

Gridmix3 is now an integral part of our Hadoop development process (Figure 3). It is
running nightly on our internal automatic regression framework to catch performance regressions; it has also reproduced a load-related failure observed in production and validated its fix.

gridmix3现在已经作为hadoop开发过程中的一个基础部分。每晚上在内部的自动回归验证集群上验证运行,验证性能问题。也可以帮助发现生产环境中的复杂相关的缺陷以及验证改进。



原文地址:

http://developer.yahoo.com/blogs/hadoop/posts/2010/04/gridmix3_emulating_production/#comments






抱歉!评论已关闭.