info-tech Articles

Unbound

Spark Or Hadoop Which Course You should Take

... Preetham Varma

The Spark vs Hadoop debate has picked up momentum off late and it’s becoming somewhat confusing for the aspirants of Big Data as to which framework they should be focusing on from a Big data career perspective.

Both Hadoop and Spark are Big Data frameworks and both offer an array of tools to carry out complex Big Data tasks. Both Hadoop and Spark have many similar use cases, however, it’s also true that Hadoop and Spark are after all no directly comparable products.

One good way of approaching this question is, therefore to look at the similarities, differences and relative qualities of the two frameworks and understand how this question can be best answered.

Where Hadoop is better than Spark

  • Unlike Hadoop, Spark does not provide its own distributed storage system. Distributed storage allows vast multi-petabyte datasets to be stored across an almost infinite number of computer hard drives. It, thus, removes the need for maintaining costly custom machinery which would hold it all on one device. Distributed storage systems are scalable and more drives can be added to the network as the dataset grows in size.
  • Spark is a young framework and the security and support infrastructure is not as advanced as it is in case of the Hadoop systems

Where Spark is Better than Hadoop

Spark is reported to work up to 100 times faster than Hadoop in certain circumstances. Below is a benchmark test which shows the performance of Spark and Hadoop Map Reduce in sorting a 100 TB data on disk without even using the Spark’s in-memory operations.

Source: https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Source: https://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf

  • As against Hadoop’s MapReduce based system, Spark arranges data in what are known as Resilient Distributed Datasets, which can be recovered following failure.
  • Spark is capable of real time stream processing meaning that data can be fed into an analytical application the moment it is captured. This allows for sharing of the insights with end users immediately for further action, for example for its use by Recommendation Engines or for the performance monitoring of industrial machinery to name a few.
  • Spark, owing to its speed and ability to handle streaming data, is also an ideal framework for Machine learning which creates thinking algorithms through a process of statistical modeling and simulation.
  • Spark boasts of its own machine learning libraries, called MLib, whereas Hadoop systems depend on third-party machine learning library, such as Apache Mahout.
  • Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.
  • Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.

Complementarity between Hadoop & Spark

Spark’s core limitation is that it doesn’t have it’s own Distributed storage and depends on third party solution for this purpose. This in fact creates the ground for the two frameworks to come together. Many Big Data practitioners are now installing Spark on top of Hadoop thereby allowing Spark’s advanced analytics applications to run on the data stored using the Hadoop Distributed File System (HDFS).This is allowing Spark to emerge as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.

For the aspirants of Big Data career the question that arises is whether one should go for Hadoop or Spark or whether one can directly jump to Spark, bypassing Hadoop. The answer is, you don't need to learn Hadoop to learn Spark. Spark can run on top of HDFS along with other Hadoop components. Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack. For developers, there is almost no overlap between the two. Hadoop is a framework in which one writes MapReduce job by inheriting Java classes. Spark is a library that enables parallel computation via function calls. 

Relevant For

Big Data Hadoop Data Science