Hadoop 0.20.1 (comparison of the grid/cloud computing frameworks - part I)
Test environment
Hadoop 0.20.1 differs from other frameworks tested in the comparison, because it divides the problem by itself. We used mapred-site.xml with the following settings:
- dfs.replication = 1
- mapred.map.tasks = 50
- mapred.reduce.tasks = 80
- mapred.tasktracker.reduce.tasks.maximum = 8
This allowed us to fully utilize available processing units. Hadoop is a centralized framework (with a special master nodes), so we also had to run additional processes on one of the machines. You can see the architecture of the test environment on the following figure:
Code
Hadoop operates on the computational units named jobs. Jobs uses special classes called Mappers and Reducers. We used the Mapper to divide the problem into smaller tasks:
public static class CMBFMapper extends Mapper<IntWritable, IntWritable, Text, FastBigInt128> { @Override public void map(IntWritable i, IntWritable lvl, Context context) throws IOException, InterruptedException { for ( final String[] imageDesc : new Worker().generateImages(i.get(), lvl.get())) { StringBuilder buffer = new StringBuilder(imageDesc[0]); for (int j = 1; j < imageDesc.length; ++j) { buffer.append("^"); buffer.append(imageDesc[j]); } context.write(new Text(buffer.toString()), new FastBigInt128()); } } }
After that we had to compute results in the Reducer:
public static class CMBFReducer extends Reducer<Text, FastBigInt128, WritableComparable<?>, Writable> { @Override public void reduce(Text imageDesc, Iterable<FastBigInt128> values, Context context) throws IOException, InterruptedException { FastBigInt128 res = new Worker().countInImage(imageDesc.toString().split("\\^")); context.write(imageDesc, res); } }
You can find all the above code in our code repository: http://dacframe.org/lab
Results
You can see all the results with std deviation and average values on the following table:
Hadoop 0.20.1 (r80) <-->| Average | 467 042.70 | 384 331.60 | 365 660.40 |
| Std Deviation | 4 037.70 | 5 629.01 | 2 001.05 |
| Tasks: 341 | Tasks: 2705 | Tasks: 33700 | |
|---|---|---|---|
| 472 149 | 388 194 | 363 537 | |
| 466 087 | 376 008 | 368 091 | |
| 468 763 | 391 167 | 364 762 | |
| 463 126 | 382 037 | 363 380 | |
| 472 088 | 379 351 | 367 917 | |
| 469 624 | 382 037 | 366 299 | |
| 465 888 | 382 137 | 364 032 | |
| 459 676 | 380 012 | 367 296 | |
| 463 948 | 390 333 | 367 741 | |
| 469 078 | 392 040 | 363 549 |
Just for fun we repeated the Hadoop tests with increased property 'mapred.reduce.tasks' (to 340, 2700 and 33700). You can see the results on the following table:
Hadoop 0.20.1 (r340, r2700, r33700) <-->| Average | 422 656.30 | 1 664 799.30 | 20 368 147.80 |
| Std Deviation | 12 583.22 | 29 127.12 | 3 107.74 |
| Tasks: 341 | Tasks: 2705 | Tasks: 33700 | |
|---|---|---|---|
| 415 299 | 1 655 003 | 20 369 962 | |
| 456 019 | 1 747 665 | 20 361 058 | |
| 421 376 | 1 656 324 | 20 368 348 | |
| 417 498 | 1 656 402 | 20 366 900 | |
| 418 430 | 1 654 101 | 20 367 456 | |
| 430 729 | 1 656 215 | 20 370 007 | |
| 415 381 | 1 656 499 | 20 368 083 | |
| 415 151 | 1 654 693 | 20 370 457 | |
| 418 479 | 1 655 648 | 20 366 582 | |
| 418 201 | 1 655 443 | 20 372 625 |
As you can see on the first table, std deviation is quite big for 341 and 2705 tasks test cases. This means, that some sort of task pre-fetching took place and the load balancer could be improved. However, significant growth of number of tasks to compute (from 341 to 33700) decreased the time of computation! This means, that the number of key/value pairs doesn't matter for Hadoop. Only the number of map/reduce tasks is important (as you can see on the second table).
CPU
CPU usage (%user and %system) gathered on the intel1 machine:
Memory
Memory usage gathered on the intel1 machine:
Network
Network usage (received and transmitted bytes/s) gathered on the intel1 machine:
Attachments
-
Hadoop-test-env.png
(14.5 KB) - added by jeremian
2 years ago.
Hadoop-test-env
-
Hadoop_0_20_1-cpu.png
(32.5 KB) - added by klider
2 years ago.
Hadoop-cpu
-
Hadoop_0_20_1-memory.png
(29.0 KB) - added by klider
2 years ago.
Hadoop-memory
-
Hadoop_0_20_1-network.png
(42.0 KB) - added by klider
2 years ago.
Hadoop-network



