SparkContext和RDD - hard-working

公告

SparkContext.scala实现了一个SparkContext的class和object,SparkContext类似Spark的入口，负责连接Spark集群，创建RDD,累积量和广播量等。

在Spark框架下该类在一个JVM中只加载一次。在加载类的阶段，SparkContext类中定义的属性，代码块，函数均被加载。

（1）class SparkContext(config:SparkConf) extends Logging with ExecutoAllocationClient,类SparkContext的默认构造参数为SparkConf类型，SparkContext继承了Logging,以及ExecutoAllocationClient trait，多个trait继承采用了with连接，trait没有任何类参数，trait调用的方法是动态绑定的。

（2）private val creationSite:CallSite=Utils.getCallSite()

val startTime=Syatem.currentTimeMillis()

1.未加private的变量：使用val声明的字段，只有公有的getter方法（getter和setter分别表示为creationSite=和creationSite_=），

而使用var声明的字段，getter和setter方法都是公有的。

2.加private的变量：相对于的val和var声明的getter或setter方法变成私有的方法

（3）：private[spark] val stopped:AtomicBoolean=new AtomicBoolean(false)

private[class_name]指定可以访问该字段的类，class_name必须是当前定义的类，或当前定义的类的外部类，会生成getter和setter方法。private[this]:只有同一个对象中可见，类私有基础之上的对象私有

（4）：private def assertNotStopped():Unit --该方法为一个过程，因为返回值为Unit，同时为类的私有方法

(5)：def this()=this (new SparkConf())主构造器 SparkContext类的构造器，默认参数为SparkConf类型的参数

def this(config:SparkConf,preferredNodeLocationData:Map[String,Set[SplitInfo]])的定义需要首先调用this（config）超方法

（6）：private[spark] def this(master:String,appName:String)spark类的私有构造方法

（7） @volatile private var _dagScheduler:DAGScheduler=_

private var _applicationId:String=_

@volatile注释，通过编译器，被注释的变量将被多个线程使用，这些变量都将在类加载时被实例化

（8）：在try{}catch{}代码块----其中的各种条件语句，属性的初始值，使用master创建taskSchedule等相应的参数

（9）：private[spark] def withScope[U](body:=>U):U=RDDOperationScop.withScope[U](this)(body)

其中U代表类型，比如自定义的类或者scala固有的类，body指向operation，一段代码段，SparkContext类中多处使用该函数。

（10））：def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
    path: String,
    fClass: Class[F],
    kClass: Class[K],
    vClass: Class[V],
    conf: Configuration = hadoopConfiguration): RDD[(K, V)]）

函数声明说明：调用newAPIHadoopFile[LongWritable,Text,TextInputFormat]("hdfs://ip:port/path/to/file")

path:待读取的文件；conf：hadoop配置文件；fclass：InputFormat输入的数据格式；kClass:输入格式的key的类型；vClass：输入格式的value的类型

（11）：def sequenceFile[K, V]
       (path: String, minPartitions: Int = defaultMinPartitions)
       (implicit km: ClassTag[K], vm: ClassTag[V],
        kcf: () => WritableConverter[K], vcf: () => WritableConverter[V]): RDD[(K, V)])
该函数中有默认参数设定，以及一个隐式的转换,柯里化函数

（12)：createTaskScheduler创建任务调度器

(13) ：def stop() 关闭SparkContext;object SparkMasterRegex 用于模式匹配;类WritableFactory和object WritableFactory中包含了隐式工厂操作,implicit def longWritableFactory:WritableFactory[Long] 隐式操作

RDD 抽象类abstract，extends Serializable with Logging
（1）：final 标示的函数和属性均不可被覆写
（2）：对于继承抽象类的子类对父类中的方法进行覆写时，需要加override标示
RDD抽象类被其他的RDD类，如HadoopRDD，继承，在子类中对父类的方法进行覆写，以适用于自身的各种RDD操作
排序，map，reduce操作等

Map是不可变集合，不可以增加减。

val person=Map("spark"->6,"Hadoop"->12)

这样定义是不可以增加减的

val person=scala.collection.mutable.Map("spark"->6,"Hadoop"->12)

这样可以增加元素，如：

person+=("file"->5)

也可以减元素，如：

person-=“file”

posted on 2016-01-17 01:33 hard-working 阅读(688) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

午夜的风

公告