1.spark核心RDD特点
RDD(Resilient Distributed Dataset)
Spark源码:https://github.com/apache/spark
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging
1.RDD是一个抽象类(不能直接使用,子类实现抽象方法后才能用)
2.带泛型的,可以支持多种类型:String、Person、User
RDD:Resilient Distributed Dataset 弹性 分布式 数据集
Represents an immutable,(不可变)
partitioned collection of elements (分区)
that can be operated on in parallel (并行计算)
Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
rdd1=>rdd2=>rdd3
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
* an HDFS file) 数据在哪优先把作业调度到数据所在结点计算:移动数据不如移动计算
五大特性源码体现:
def compute(split: Partition, context: TaskContext): Iterator[T] 特性二
protected def getPartitions: Array[Partition] 特性一
protected def getDependencies: Seq[Dependency[_]] = deps 特性三
protected def getPreferredLocations(split: Partition): Seq[String] = Nil 特性五
val partitioner: Option[Partitioner] = None 特性四