Hadoop TeraSort Benchmark

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

TeraGen generates random data that can be used as input data for a subsequent running of TeraSort.

Generate input by TeraGen

Table of Contents

The syntax for TeraGen:

$ hadoop jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

To make the TeraGen run on multiple nodes with multiple tasks, you may need to specify the number of map tasks (30 here as an example; for Hadoop 2):

$ hadoop -D mapreduce.job.maps 30 \
jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

The number of mappers depends on the number of rows you will generate and the number of nodes you have. For more information on how to set the number of mappers and reducers, please check this post.

Run TeraSort

After the data is generated, run the sort by TeraSort

$ hadoop jar hadoop-*examples*.jar terasort \
<input dir> <output dir>

You may also need to set the number of mappers and reducers for better performance.

Validate the sorted output data of TeraSort

TeraValidate ensures that the output data of TeraSort is globally sorted.

The syntax for TeraValidate:

$ hadoop jar hadoop-*examples*.jar teravalidate \
<output dir> <terasort-validate dir>

One comment:

Eric Zhiqiang Ma says:

Jul 23, 2014 at 6:34 pm

For large datasets, you may need to specify the number of mappers and reducers to make the computation and data distributed across nodes:

https://www.systutorials.com/qa/947/how-set-the-number-mappers-and-reducers-hadoop-command-line

Generate input by TeraGen

Run TeraSort

Validate the sorted output data of TeraSort

Eric Ma

One comment: