Spark And Hadoop Are Friends, Not Foes
As a long-time big data practitioner, an early advocate for investment in Hadoop by Yahoo! and now CEO of a company that provides big data as a service for the enterprise, I’d like to bring some perspective and clarity to this conversation.
Spark and Hadoop work together.
Hadoop is increasingly the enterprise platform of choice for big data. Spark is an in-memory processing solution that runs on top of Hadoop. The largest users of Hadoop — including eBay and Yahoo! — both run Spark inside their Hadoop clusters. Cloudera and Hortonworks ship Spark as part of their Hadoop distributions. And our own customers here at Altiscale have been using Spark on Hadoop since we launched.
To position Spark in opposition to Hadoop is like saying that your new electric car is so cool that you won’t need electricity anymore. If anything, electric cars will drive demand for more electricity.
YARN can host any number of programming frameworks. The original such framework was MapReduce, invented at Google to help process massive web crawls. Spark is another such framework, as is another new one called Tez. When people talk about Spark “crushing” Hadoop, what they really mean is that programmers now prefer using Spark to the older MapReduce framework.
However, MapReduce should not be equated with Hadoop. MapReduce is just one of many ways to process your data in a Hadoop cluster. Spark can be used as an alternative. Looking more broadly, business analysts — a growing base of big data practitioners — avoid both of these frameworks, which are low-level toolkits meant for programmers. Instead, they use high-level languages like SQL that make Hadoop more accessible.
Spark isn’t a challenger that’s going to replace Hadoop. Rather, Hadoop is a foundation that makes Spark possible. We expect to see increasing adoption of both as organizations seek the broadest and most robust platform possible for turning their data assets into actionable business insight.