scala> val text=spark.read.textFile("/tmp/20171024/tian.txt")
text: org.apache.spark.sql.Dataset[String] = [value: string]
scala> text.count
res0: Long = 6
scala> val text=sc.textFile("/tmp/20171024/tian.txt")
text: org.apache.spark.rdd.RDD[String] = /tmp/20171024/tian.txt MapPartitionsRDD[7] at textFile at <console>:24
scala> text.count
res1: Long = 6
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read theAPI doc.
Caching
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark
dataset to be cached:
scala> text.cache()
res2: text.type = /tmp/20171024/tian.txt MapPartitionsRDD[7] at textFile at <console>:24
scala> text.count
res3: Long = 6
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes.
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
Unlike the earlier examples with the Spark shell, which initializes its own SparkSession, we initialize a SparkSession as part of the program.
We call SparkSession.builder
to construct a [[SparkSession]], then set the application name, and finally call getOrCreate
to get the [[SparkSession]] instance.
Our application depends on the Spark API, so we’ll also include an sbt configuration file, build.sbt
, which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
name := "Simple Project" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
For sbt to work correctly, we’ll need to layoutSimpleApp.scala
andbuild.sbt
according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use thespark-submit
script to run our program.
# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.11/simple-project_2.11-1.0.jar
...
Lines with a: 46, Lines with b: 23
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.
Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task.
Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext
you first need to build a SparkConf object that contains information about your application.
Only one SparkContext may be active per JVM. You must stop()
the active SparkContext before creating a new one.
The appName
parameter is a name for your application to show on the cluster UI. master
is a Spark, Mesos or YARN cluster URL, or a special
For example, to run bin/spark-shell
on exactly four cores, use:
$ ./bin/spark-shell --master local[4]
Or, to also add code.jar
to its classpath, use:
$ ./bin/spark-shell --master local[4] --jars code.jar
To include a dependency using Maven coordinates:
$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster.
Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster.
However, you can also set it manually by passing it as a second parameter to parallelize
(e.g. sc.parallelize(data, 10)
).
Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase,Amazon S3, etc.
Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value.
Note that you cannot have fewer partitions than blocks.
SparkContext.wholeTextFiles
lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs.
This is in contrast with textFile
, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions.
For those cases, wholeTextFiles
provides an optional second argument for controlling the minimal number of partitions.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map
is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce
is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey
that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map
will be used in a reduce
and return only the result of the reduce
to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist
(or cache
) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
To illustrate RDD basics, consider the simple program below:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines
is merely a pointer to the file. The second line defines lineLengths
as the result of a map
transformation. Again, lineLengths
is not immediately computed, due to laziness. Finally, we run reduce
, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.
If we also wanted to use lineLengths
again later, we could add:
lineLengths.persist()
before the reduce
, which would cause lineLengths
to be saved in memory after the first time it is computed.
Passing Functions to Spark
Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object. For example, you can define
object MyFunctions
and then passMyFunctions.func1
, as follows:
object MyFunctions {
def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)
Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass
instance and call doStuff
on it, the map
inside there references the func1
method of that MyClass
instance, so the whole object needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x))
.
In a similar way, accessing fields of the outer object will reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
is equivalent to writing rdd.map(x => this.field + x)
, which references all of this
. To avoid this issue, the simplest way is to copy field
into a local variable instead of accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}
Prior to execution, Spark computes the task’s closure.
The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()
).
This closure is serialized and sent to each executor.
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
Local vs. cluster modes
The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()
). This closure is serialized and sent to each executor.
The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach
function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counterwere referencing the value within the serialized closure.
In local mode, in some circumstances the foreach
function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.
To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator
. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
when using custom objects as the key in key-value pair operations, you must be sure that a custom equals()
method is accompanied with a matching hashCode()
method
Transformations
The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.
Actions
The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)
and pair RDD functions doc (Scala, Java) for details.
Action | Meaning |
---|---|
reduce(func) | Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. |
collect() | Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. |
count() | Return the number of elements in the dataset. |
first() | Return the first element of the dataset (similar to take(1)). |
take(n) | Return an array with the first n elements of the dataset. |
takeSample(withReplacement,num, [seed]) | Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed. |
takeOrdered(n, [ordering]) | Return the first n elements of the RDD using either their natural order or a custom comparator. |
saveAsTextFile(path) | Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. |
saveAsSequenceFile(path) (Java and Scala) |
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). |
saveAsObjectFile(path) (Java and Scala) |
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile() . |
countByKey() | Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. |
foreach(func) | Run a function func on each element of the dataset. This is usually done for side effects such as updating anAccumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details. |
The Spark RDD API also exposes asynchronous versions of some actions, like foreachAsync
for foreach
, which immediately return aFutureAction
to the caller instead of blocking on completion of the action. This can be used to manage or wait for the asynchronous execution of the action.
Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Shared Variables
Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Broadcast variables are created from a variable v
by calling SparkContext.broadcast(v)
. The broadcast variable is a wrapper around v
, and its value can be accessed by calling the value
method. val broadcastVar = sc.broadcast(Array(1, 2, 3))
Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.
scala> val accnum=sc.longAccumulator("ggg")
scala> sc.parallelize(Array(1,2,3,4,5)).foreach(x=>accnum.add(x))
Accumulators
Accumulable | Value |
---|---|
ggg | 15 |
Tasks (4)
Index ▴ | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | Scheduler Delay | Task Deserialization Time | GC Time | Result Serialization Time | Getting Result Time | Peak Execution Memory | Accumulators | Errors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 0 | SUCCESS | PROCESS_LOCAL |
driver / localhost
|
2017/11/10 17:03:33 | 1 ms | 4 ms | 8 ms | 0 ms | 0 ms | 0.0 B | ggg: 1 | ||
1 | 64 | 0 | SUCCESS | PROCESS_LOCAL |
driver / localhost
|
2017/11/10 17:03:33 | 1 ms | 4 ms | 5 ms | 1 ms | 0 ms | 0.0 B | ggg: 2 | ||
2 | 65 | 0 | SUCCESS | PROCESS_LOCAL |
driver / localhost
|
2017/11/10 17:03:33 | 0 ms | 2 ms | 9 ms | 0 ms | 0 ms | 0.0 B | ggg: 3 | ||
3 | 66 | 0 | SUCCESS | PROCESS_LOCAL |
driver / localhost
|
2017/11/10 17:03:33 | 0 ms | 1 ms | 9 ms | 1 ms | 0 ms | 0.0 B | ggg: 9 |
Deploying to a Cluster
The application submission guide describes how to submit applications to a cluster. In short, once you package your application into a JAR (for Java/Scala) or a set of .py
or .zip
files (for Python), the bin/spark-submit
script lets you submit it to any supported cluster manager.
Launching Spark jobs from Java / Scala
The org.apache.spark.launcher package provides classes for launching Spark jobs as child processes using a simple Java API.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext
in your test with the master URL set to local
, run your operations, and then call SparkContext.stop()
to tear it down. Make sure you stop the context within a finally
block or the test framework’s tearDown
method, as Spark does not support two contexts running concurrently in the same program.