RDD源码分析

RDD源码解析

一、

RDD.scala

- Resilient Distributed Dataset (RDD) 
    弹性分布式数据集

    弹性: 体现在计算上面

- the basic abstraction in Spark
- Represents an immutable
    val

    RDDA == RDDB

- partitioned collection of elements
- that can be operated on in parallel 

RDDA: (1,2,3,4,5,6,7,8,9)               operated +1。(对RDD执行加1的操作)
    hadoop000:Partition1: (1,2,3)        +1
    hadoop001:Partition2: (4,5,6)        +1
    hadoop002:Partition3: (7,8,9)        +1

对RDD上的所有元素进行加1,他在hadoop000,hadoop001,hadoop002三台机器上同时进行
对RDD进行操作,也就是对`RDD上的所有分区进行操作`
abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {}

关键字: (从上面获得的信息)
1) 抽象类: RDD必然是有之类实现的,我们使用时直接使用其之类即可
2) Serializable(序列化)
3) Logging(日志)
4) T (泛型)
5) SparkContext (入口点)
6) @transient(注解,暂时不懂)

二、JdbcRDD.scala

class JdbcRDD[T: ClassTag](
    sc: SparkContext,
    getConnection: () => Connection,
    sql: String,
    lowerBound: Long,
    upperBound: Long,
    numPartitions: Int,
    mapRow: (ResultSet) => T = JdbcRDD.resultSetToObjectArray _)
  extends RDD[T](sc, Nil) with Logging {

三、 RDD五大特性:

Internally, each RDD is characterized by five main properties:  
        (1、2、3必选,4、5可选)  
    1) A list of partitions    (分区列表)
    2) A function for computing each split/partition   (用于计算每个 分片/分区 的函数)
    3) A list of dependencies on other RDDs   (其它的RDD依赖关系)
            RDDA => RDDB => RDDC ==> RDDD  
    4) Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)    (可选的,用于键值RDD的分区程序,(例如: 说明RDD时哈希分区))
    5) Optionally, a list of preferred locations to compute each split on (e.g. block locations foran HDFS file) (可选的,用于计算每个首选位置的分片列表(例如: 块位置为HDFS文件))

   preferred locations (一个RDD,对应多个partition,所有有 s )

   深入理解 RDD 与 关键字 之间的关系
   Resilient、Distributed、Dataste   (弹性、分布式、数据集)

    (木桶原理,性能由最短的那块板决定,由最慢的任务决定计算性能)

四、RDD五大特性和RDD源码中 方法的 对应关系

  1) def compute(split: Partition, context: TaskContext): Iterator[T]
  
  2) protected def getPartitions: Array[Partition]
  
  3) protected def getDependencies: Seq[Dependency[_]] = deps
  
  4) protected def getPreferredLocations(split: Partition): Seq[String] = Nil
  
  5) @transient val partitioner: Option[Partitioner] = None

posted @ 2019-05-07 22:18  BBBone  阅读(300)  评论(0编辑  收藏  举报