Blog

Hadoop Benchmark: Cloudera vs. Hortonworks vs. MapR

Alex Khizhniak

Evaluating Hadoop distributions across 7 workloads

Cloudera, Hortonworks, and MapR are the most popular Hadoop distributions available today. However, even with this short list, there are few unbiased comparisons of their cluster performance. So, today we’re introducing a 65-page research paper that contains a vendor-independent overview of Cloudera, Hortonworks, and MapR distributions.

cloudera_hortonworks_mapr

Vladimir Starostenkov of Altoros compared throughput of 8-, 12-, and 16-node clusters against performance of a 4-node cluster. (The speed of data processing of 8-, 12-, and 16-node clusters was divided by the throughput of a 4-node cluster.) The results were quite unexpected.

 

Hadoop cluster performance: bigger doesn’t mean faster

In a recent interview to TechTarget, our R&D Engineer Dmitriy Kalyada explained why adding nodes to a Hadoop cluster not always results in better performance. The new benchmark of Hadoop distributions confirms this behavior under several workloads.

For instance, when sorting unstructured text data (the Sort workload), the performance of a MapR cluster was growing linearly (as we were increasing its size from 4 to 8 nodes). After that, when new machines were added, the throughput of each separate node was degrading.

mapr_performance

As you can see on the diagram, an 8-node cluster turned out to be faster than clusters of 12 and 16 nodes. The same situation was observed in the DFSIO write test. Other Hadoop distributions had similar results under some of the workloads, too.

Download the benchmark to see all the performance results (83 diagrams, 7 types of workloads), including:

  • detailed performance results for 4-, 8-, 12-, and 16-node clusters
  • how the size of a cluster affects data processing speed
  • how different clusters behave under CPU and disk-bound workloads (including Bayes, DFSIO, Hive aggregation, PageRank, Sort, TeraSort, and WordCount)
  • what issues slow down deployment and how to maximize Hadoop processing speed

Get your copy of “Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR” and let us know what you think about these results.

4 Comments
  • LaurentH

    Hello,
    Thanks for publishing your work. I am planning to benchmark cloudera vs hortonworks as well. Here are a few remarks about your publication :
    §4.1 : title is “Overall cluster performance”, but it is actually mapr’s result (which is title of §4.4)

    “These values demonstrate cluster scalability in each of the tests.”
    Performance is linear on only 2/7 tests, even worse on 8 nodes vs 4 nodes for 4/7 tests (fig.2). I guess it shows pretty much the opposite.

    For the next figures : Is the performance scale common to all 3 distros ?

    §4.2 :
    Fig 4 through 7 : legend “number of nodes” on y axis is misleading,
    as the number of nodes is the 4/8/12/16 (colored) dimension, with legend in bottom

    §4.3 :
    “the difference in performance between the
    distributions was within the limit of an experimental error

    which error, and how would it have affected the results ?

    Fig.11 through 25, 80 through 83 :
    the different cluster configuration (3 times more disk) between mapr and cloudera/hw cluster configuration is not notified anywhere near the illustrations (in particular, dfsio tests fig 16 and 17), which could be very misleading. Unless your normalized it in anyway, which you wouldn’t have mentionned.

    Fig.45,47,49,51,53,55,57,58bis(should be 59), 61 : MapR instead of Hortonworks

    Overally, the cluster seems to be too I/O bound to show scalability. A general observation would be that performance linearity is far from being obvious, and that the node/cluster should be carefully designed if this goal is to be achieved.

    Do you have any test result of Impala vs Hortonworks Stinger-boosted Hive (Apache Drill set aside because, as you say, it is still in early stage) ?

    Regards.

    • Vladimir Starostenkov

      Hi LaurentH,

      Thank you very much for downloading this investigation and making so many useful comments.

      “These values demonstrate cluster scalability in each of the tests. Performance is linear on only 2/7 tests, even worse on 8 nodes vs 4 nodes for 4/7 tests (fig.2). I guess it shows pretty much the opposite.”

      By “cluster scalability” I meant the ability of the system to scale under the given tests. Under some tests performance increased when more nodes were added, under others—it did not. The idea was to show that linear scalability is not always achieved by simply adding more nodes.

      “For the next figures : Is the performance scale common to all 3 distros?”

      No, it is not. It is very close to be common for HDP and CDH, but it is different for MapR. 3 disks were used for MapR and 1 disk for HDP and CDH, so it cannot be stated that they were tested in absolutely the same conditions. View the Appendix E to see a comparison of all three distributions.

      Ҥ4.3 :
      “the difference in performance between the distributions was within the limit of an experimental error” which error, and how would it have affected the results?”

      Not going deep into technical details, it is an experimental error, which is rather small; the difference in measurement values cannot exceed 5%.

      “Fig.11 through 25, 80 through 83 :
      the different cluster configuration (3 times more disk) between mapr and cloudera/hw cluster configuration is not notified anywhere near the illustrations (in particular, dfsio tests fig 16 and 17), which could be very misleading. Unless your normalized it in anyway, which you wouldn’t have mentionned.”

      It was mentioned in the very beginning of the Results section that MapR had three disks. View Appendix E to see the difference in performance.

      “Do you have any test result of Impala vs Hortonworks Stinger-boosted Hive (Apache Drill set aside because, as you say, it is still in early stage)?”

      You can start with this research: https://amplab.cs.berkeley.edu/benchmark/. It looks great even though Stinger is not yet represented.

      We’ve updated the titles of the diagrams, an updated version of the white paper will be available for downloading tomorrow. Once again, thanks for your comments!

  • mertez

    Link appears to be broken. Is it possible to fix it?

  • Alex Khizhnyak

    Hi, mertez, which of the links?.. try this one: http://www.altoros.com/hadoop_benchmark.html or http://www.altoros.com/hadoop_benchmark

Benchmarks and Research

Subscribe to new posts

Get new posts right in your inbox!