Spark - RDD(弹性分布式数据集)


org.apache.spark.rdd
RDD
abstract class RDD[T] extends Serializable with Logging

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. All operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit.
弹性分布式数据集(RDD)是Spark中的基本抽象。表示了一个不可变的,可分区的元素集合。其中的元素能够被并行的操作。这个类包含了所有在RDD上可能的操作,比如map,filter和persist. 此外,org.apache.spark.rdd.PairRDDFunctions还包括了对于键值对元素组成的RDD的可用操作。比如groupByKey和join;org.apache.spark.rdd.DoubleRDDFunctions 包含了对由doubles类型元素组成的RDD可用的操作。org.apache.spark.rdd.SequenceFileRDDFunctions 包括了对于能够保存为Hadoop SequenceFile的RDD上的可用操作。 所有的操作都是通过隐式调用对于右侧任何RDD自动可用的。例如 RDD[(Int, Int)]


Internally, each RDD is characterized by five main properties:
在内部,每个RDD主要被特征化为五个属性:
A list of partitions
一个分区列表
A function for computing each split
一个用来计算每个分割的函数
A list of dependencies on other RDDs
一个关于其他RDD的依赖性列表
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
可选的,一个关于键-值RDD的分区,或者称这个RDD为哈希分区(散列分区)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
可选的,一个被建议的位置列表用来计算每个分块的位于的结点位置,例如在HDFS文件系统上的块的位置。


All of the scheduling and execution in Spark is done based on these methods, allowing each RDD to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for reading data from a new storage system) by overriding these functions. Please refer to the Spark paper for more details on RDD internals.
在Spark中所有的调度和执行,都是基于这些方法。就是允许每一个RDD各自实现计算自身的方法。事实上,用户也可以通过重写这些函数来实现自定义RDD,例如从一个新的储存系统上读取数据。。请参阅Spark的文档来了解更多的RDD内部细节。


Linear Supertypes(父类)
Logging, Serializable, Serializable, AnyRef, Any

Known Subclasses(已知的子类<派生类>)
CoGroupedRDD, EdgeRDD, EdgeRDDImpl, HadoopRDD, JdbcRDD, NewHadoopRDD, PartitionPruningRDD, ShuffledRDD, UnionRDD, VertexRDD, VertexRDDImpl

 

 

总结

RDD是Spark的核心,也是整个Spark的架构基础。它的特性可以总结如下:

  • 它是不变的数据结构存储
  • 它是支持跨集群的分布式数据结构
  • 可以根据数据记录的key对结构进行分区
  • 提供了粗粒度的操作,且这些操作都支持分区
  • 它将数据存储在内存中,从而提供了低延迟性

(未完待续)

转载请注明:原文地址:http://www.cnblogs.com/suanec/p/4772707.html

posted @ 2015-08-31 11:44  澄轶  阅读(670)  评论(0编辑  收藏  举报