RDD源码分析
RDD源码解析
一、
RDD.scala
- Resilient Distributed Dataset (RDD)
弹性分布式数据集
弹性: 体现在计算上面
- the basic abstraction in Spark
- Represents an immutable
val
RDDA == RDDB
- partitioned collection of elements
- that can be operated on in parallel
RDDA: (1,2,3,4,5,6,7,8,9) operated +1。(对RDD执行加1的操作)
hadoop000:Partition1: (1,2,3) +1
hadoop001:Partition2: (4,5,6) +1
hadoop002:Partition3: (7,8,9) +1
对RDD上的所有元素进行加1,他在hadoop000,hadoop001,hadoop002三台机器上同时进行
对RDD进行操作,也就是对`RDD上的所有分区进行操作`
abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {}
关键字: (从上面获得的信息)
1) 抽象类: RDD必然是有之类实现的,我们使用时直接使用其之类即可
2) Serializable(序列化)
3) Logging(日志)
4) T (泛型)
5) SparkContext (入口点)
6) @transient(注解,暂时不懂)
二、JdbcRDD.scala
class JdbcRDD[T: ClassTag](
sc: SparkContext,
getConnection: () => Connection,
sql: String,
lowerBound: Long,
upperBound: Long,
numPartitions: Int,
mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
extends RDD[T](sc, Nil) with Logging {
三、 RDD五大特性:
Internally, each RDD is characterized by five main properties:
(1、2、3必选,4、5可选)
1) A list of partitions (分区列表)
2) A function for computing each split/partition (用于计算每个 分片/分区 的函数)
3) A list of dependencies on other RDDs (其它的RDD依赖关系)
RDDA => RDDB => RDDC ==> RDDD
4) Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) (可选的,用于键值RDD的分区程序,(例如: 说明RDD时哈希分区))
5) Optionally, a list of preferred locations to compute each split on (e.g. block locations foran HDFS file) (可选的,用于计算每个首选位置的分片列表(例如: 块位置为HDFS文件))
preferred locations (一个RDD,对应多个partition,所有有 s )
深入理解 RDD 与 关键字 之间的关系
Resilient、Distributed、Dataste (弹性、分布式、数据集)
(木桶原理,性能由最短的那块板决定,由最慢的任务决定计算性能)
四、RDD五大特性和RDD源码中 方法的 对应关系
1) def compute(split: Partition, context: TaskContext): Iterator[T]
2) protected def getPartitions: Array[Partition]
3) protected def getDependencies: Seq[Dependency[_]] = deps
4) protected def getPreferredLocations(split: Partition): Seq[String] = Nil
5) @transient val partitioner: Option[Partitioner] = None