Introduction to Apache Spark- Part 2
Thanks for your time; I definitely try to value yours. In part 1 – we discussed about Apache Spark libraries, Spark Components like Driver, DAG Scheduler, Task Scheduler, and Worker. Now in Part 2 -we will be discussing on Basics of Spark Concepts like Resilient Distributed Datasets, Shared Variables, SparkContext, Transformations, Action, and Advantages of using Spark along with examples and when to use Spark.
RDD – Resilient Distributed Datasets:
They are collections of serializable elements and such a collection may be partitioned in which case it is stored in multiple nodes.
It may reside in memory or on disk.
Spark uses RDD to reduce I/O and maintain the processed data in memory
RDD helps with tolerating node failures and need not be restart the whole process or computing
Typically it’s created from the Hadoop input format or from transformation applied on existing RDDs.
RDDs store its data lineage; if data is lost Spark replay the lineage to rebuild the lost RDDs.
RDDs are immutable.
Spark has two types of variables that allow sharing information between the execution nodes.
Two variables are broadcast & accumulator variables.
Broadcast variables are all sent to the remote execution nodes, similar to MapReduce Configuration objects.
Accumulators are all also sent to remote execution nodes, with the limitation that we can add only to the accumulator variables, similar to MapReduce counters.
It is an object that represents the connection to a Spark cluster.
It is used to create RDDs, broadcast data and initialize accumulators.
It is functions that take one RDD and return another.
Transformations will never modify their input, only returns the modified RDD.
It’s always lazy, so they don’t compute their results. Instead calling a transformation function only creates a new RDD.
The whole set of above said transformations is executed when an action is called.
There are many transformation in Spark – map(), filter(), KeyBy(), Join(), groupByKey(), sort().
Actions are methods that take an RDD and perform a computation and return the result to the driver application.
Action trigger the computation of transformations, and the results can be collection, values to the screen, values saved to file.
Action will never return an RDD.
- Reduced disk I/O
- Resource manager independence
- Interactive shell (REPL)
Spark, like other big data tools, it is powerful, capable, and well-suited to tackling a range of analytics & big data challenges.