Apache Spark

Apache Spark#

Context

If data becomes too large to fit into memory R/Python/Matlab won’t allow you to scale! ( small data sets can be done very easily though ).
Spark APIs directly maps to Scala collections (almost one-to-one)!
Spark is more expressive (map, flatMap, filter etc.) than Hadoop’s MapReduce.
Performant in both running time and developer’s productivity.
Enables iterations (hard with Hadoop)

Split the data
Workers/threads independently operate on data shards in parallel
Combine when done (if necessary)
Scala’s parallel collections -> collection of abstraction over shared-memory data parallel execution.

Split the data “over several nodes”.
Nodes independently operate on data shards in parallel.
Combine when done (if necessary).
We have to worry about “network latencies” between workers/nodes.
[[ non-associative reduction can give non-deterministic/consistent/correct result in parallel execution ]]

RDD: distributed counterpart of the Scala’s parallel collection !

Reading from RAM/main mem. Is ~100x faster than reading from disk!
- Read 1 MB from memory (main/RAM) -> 250us
- Read 1 MB from disk -> 20ms ( 20000us )
Main memory reference is ~1000000 ( 1M ) times faster than x-fer over the n/w US-Eu-US
- Main mem. Ref -> 100ns
- Send packet US-Eu-US -> 150ms ( 150000000ns )

Simple API -> map and reduce operations
Fault tolerance = Ability to recover for node failures -> any machine ( in that time ) was bound to fail, operate on data unthinkably without worrying about the node failures ! HUGE PLUS!

Hadoop/MapReduce
- Fault tolerance in Hadoop/MapReduce comes at a cost!
  - B/w each map and reduce stage it shuffles the data and stores intermediate data in disk ( to recover for potential failures )
- => Lot of n/w and disk!
Spark
- Retains fault tolerance but…
- Uses FP -> keeps all data “in memory” and “immutable”.
- All operations are functional-transformation (like regular Scala collections)
- To achieve tolerance -> replaying functional-transformation on original dataset over and over again (without worrying about side-effects).
- => 100x performance improvement over Hadoop!!!
- => Even more expressive APIs than Hadoop!!!

Transforming an existing RDD.
From a SparkContext ( or SparkSession ) object.
- Parallelise -> convert a local Scala collection to an RDD.
- textFile -> read a text file from HDFS or a local file system into an RDD[String]

Transformations: Returns new RDDs as result
- Are lazy!
Actions: Compute a result based on an RDD, and either returned or saved to an external storage system (HDFS)
- Are eager!