How to Benchmark a Hadoop Cluster
http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/
Is the cluster set up correctly? The best way to answer this question is empirically: run some jobs and confirm that you get the expected results. Benchmarks make good tests, as you also get numbers that you can compare with other clusters as a sanity check on whether your new cluster is performing roughly as expected. And you can tune a cluster using benchmark results to squeeze the best performance out of it. This is often done with monitoring systems in place, so you can see how resources are being used across the cluster.
To get the best results, you should run benchmarks on a cluster that is not being used by others. In practice, this is just before it is put into service, and users start relying on it. Once users have periodically scheduled jobs on a cluster it is generally impossible to find a time when the cluster is not being used (unless you arrange downtime with users), so you should run benchmarks to your satisfaction before this happens.
Experience has shown that most hardware failures for new systems are hard drive failures. By running I/O intensive benchmarks—such as the ones described next—you can “burn in” the cluster before it goes live.
Hadoop Benchmarks
Hadoop comes with several benchmarks that you can run very easily with minimal setup cost. Benchmarks are packaged in the test JAR file, and you can get a list of them, with descriptions, by invoking the JAR file with no arguments:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
Most of the benchmarks show usage instructions when invoked with no arguments. For example:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO
TestFDSIO.0.0.4 Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile resultFileName] [-bufferSize Bytes]
Benchmarking HDFS with TestDFSIO
TestDFSIO
tests the I/O performance of HDFS. It does this by using a MapReduce job as a convenient way to read or write files in parallel. Each file is read or written in a separate map task, and the output of the map is used for collecting statistics relating to the file just processed. The statistics are accumulated in the reduce, to produce a summary.
The following command writes 10 files of 1,000 MB each:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10
-fileSize 1000
At the end of the run, the results are written to the console and also recorded in a local file (which is appended to, so you can rerun the benchmark and not lose old results):
%
cat TestDFSIO_results.log
----- TestDFSIO ----- : write Date & time: Sun Apr 12 07:14:09 EDT 2009 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 7.796340865378244 Average IO rate mb/sec: 7.8862199783325195 IO rate std deviation: 0.9101254683525547 Test exec time sec: 163.387
The files are written under the /benchmarks/TestDFSIO
directory by default (this can be changed by setting thetest.build.data
system property), in a directory called io_data
.
To run a read benchmark, use the -read
argument. Note that these files must already exist (having been written byTestDFSIO -write
):
%
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10
-fileSize 1000
Here are the results for a real run:
----- TestDFSIO ----- : read Date & time: Sun Apr 12 07:24:28 EDT 2009 Number of files: 10 Total MBytes processed: 10000 Throughput mb/sec: 80.25553361904304 Average IO rate mb/sec: 98.6801528930664 IO rate std deviation: 36.63507598174921 Test exec time sec: 47.624
When you’ve finished benchmarking, you can delete all the generated files from HDFS using the -clean
argument:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean
Benchmarking MapReduce with Sort
Hadoop comes with a MapReduce program that does a partial sort of its input. It is very useful for benchmarking the whole MapReduce system, as the full input dataset is transferred through the shuffle. The three steps are: generate some random data, perform the sort, then validate the results.
First we generate some random data using RandomWriter
. It runs a MapReduce job with 10 maps per node, and each map generates (approximately) 10 GB of random binary data, with key and values of various sizes. You can change these values if you like by setting the properties test.randomwriter.maps_per_host
and test.randomwrite.bytes_per_map
. There are also settings for the size ranges of the keys and values; see RandomWriter
for details.
Here’s how to invoke RandomWriter
(found in the example JAR file, not the test one) to write its output to a directory calledrandom-data
:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data
Next we can run the Sort
program:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data
The overall execution time of the sort is the metric we are interested in, but it’s instructive to watch the job’s progress via the web UI (http://
), where you can get a feel for how long each phase of the job takes.jobtracker-host
:50030/
As a final sanity check, we validate the data in sorted-data
is, in fact, correctly sorted:
%
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \
-sortOutput sorted-data
This command runs the SortValidator
program, which performs a series of checks on the unsorted and sorted data to check whether the sort is accurate. It reports the outcome to the console at the end of its run:
SUCCESS! Validated the MapReduce framework's 'sort' successfully.
Other benchmarks
There are many more Hadoop benchmarks, but the following are widely used:
-
MRBench
(invoked withmrbench
) runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive. -
NNBench
(invoked withnnbench
) is useful for load testing namenode hardware. -
Gridmix is a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen in practice. See
src/benchmarks/gridmix2
in the distribution for further details.[63]
User Jobs
For tuning, it is best to include a few jobs that are representative of the jobs that your users run, so your cluster is tuned for these and not just for the standard benchmarks. If this is your first Hadoop cluster and you don’t have any user jobs yet, then Gridmix is a good substitute.
When running your own jobs as benchmarks you should select a dataset for your user jobs that you use each time you run the benchmarks to allow comparisons between runs. When you set up a new cluster, or upgrade a cluster, you will be able to use the same dataset to compare the performance with previous runs.
[63] In a similar vein, PigMix is a set of benchmarks for Pig available from http://wiki.apache.org/pig/PigMix.
Apache Hadoop is ideal for organizations with a growing need to process massive application datasets.Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.
相关推荐
How to Benchmark Your Linux System.mp4How to Benchmark Your Linux System.mp4How to Benchmark Your Linux System.mp4
A Benchmark Approach to Quantitative Finance.
-v verbosity How much troubleshooting info to print -w Print out results in HTML tables -i Use HEAD instead of GET -x attributes String to insert as table attributes -y attributes String to ...
ECCV2014最新论文RGBD Salient Object Detection A Benchmark and Algorithms。
You’ll make sure slow code doesn’t creep back into your Ruby application by writing performance tests, and you’ll learn the right way to benchmark Ruby. And finally, you’ll dive into the Ruby ...
yahoo对hadoop的benchmark方法的论文,被业界广泛认可
1/Hadoop平台搭建及实例运行.doc; 2/hadoop常见测试问题_自测试.docx; 3/hadoop源代码分析.docx; 4/Hibench BenchMark suite.docx。
Benchmark etcd性能测试工具
第三代benchmark模型,可用于主动控制算法的比较
Fritz Chess Benchmark4.3.2完全汉化版 支持12线程 汉化作者:坑爹的小A
PHASE II OF THE ASCE BENCHMARK STUDY ON SHM
Benchmark functions.zip 是一些常用的优化算法测试函数,共有17个。
Oracle and Intel have completed a series of benchmarks to test the performance and scalability of MySQL Cluster on commodity servers equipped with the latest-generation of Intel® Xeon® E5 processor ...
php benchmark 工具
the paper of single-image super-solution in this paper ,you can read the knowledge of super-solution
Advanced This example shows how to create a proxy server to redirect the calls to another server without having to recreate the RODL file, thus allowing the use of the same types of the original ...
概述Hadoop-Benchmark是一个开源研究加速平台,用于快速原型开发和评估Hadoop集群中的自适应行为。 主要目的是使研究人员能够−快速原型化,即在Hadoop集群中进行自适应实验,而无需处理底层系统基础架构的细节, −...
常用的智能优化算法的benchmark函数,供优化算法论文性能比较实用。
benchmark factory TPC-C 测试指南,每个步骤都有截图。同时介绍了安装benchmark factory时遇错的处理方法。
Chapter 4 - The Parallel Effective I/O Bandwidth Benchmark—b_eff_io Chapter 5 - Parallel Join Algorithms on Clusters Chapter 6 - Server-side Scheduling in Cluster Parallel I/O Systems Chapter ...