satyrs

BlockManager

摘要： dropFromMemory（info查看是否可drop，drop之后report给manager，info清除） 1）TimeStampedHashMap[BlockId,BlockInfo] 是否存在可drop的block=> 从info中获得StorageLevel 2）StorageLeve 阅读全文

posted @ 2017-10-08 02:37 satyrs 阅读(216) 评论(0) 推荐(0) 编辑

basic spark or spark essentials-02(notes)

摘要： submitjob：：做了什么 1含有dagScheduler的runJob函数的runJob是入口，并且是堵塞的操作,即直到Spark完成Job的运行之前,rdd.doCheckpoint()是不会执行的。堵塞在3的waiter.awaitResult()操作,即submitJob会返回一个wai 阅读全文

posted @ 2017-10-07 19:35 satyrs 阅读(113) 评论(0) 推荐(0) 编辑

by name parameter & _的用法

摘要： A by-name parameter acts like a def. Scala has a solution to this problem called by-name parameters. By declaring a parameter as a: => A (note that th 阅读全文

posted @ 2017-10-07 16:17 satyrs 阅读(353) 评论(0) 推荐(0) 编辑

excption via custom control

摘要：（The other reason (and the one more pertinent to Java developers), is that it provides a nice way to handle common exceptions. Why do I say nice? Firs 阅读全文

posted @ 2017-10-07 15:55 satyrs 阅读(144) 评论(0) 推荐(0) 编辑

统计

摘要：一些思考，不太严谨，从整体上看模型的思路，进行比较。极大似然？就是后验、大量样本的整体出现概率值最大。样本之间独立。可应用乘法原理。条件概率，即某(些)条件下某(些)事件出现的概率。决策树则是求其极大值，局部选择当前条件概率最大。条件概率越大，不确定性越低，条件熵越小。整体熵未必减小。考虑整体阅读全文

posted @ 2017-10-07 02:38 satyrs 阅读(133) 评论(0) 推荐(0) 编辑

dependency & DF & DataSet & patitioner

摘要： dependecy narrow :onetoone prune range wide :shuffle 查看依赖： .dependecies .toDebugString DF catalyst:(sql's query optimizer) reordering operations reduc 阅读全文

posted @ 2017-10-06 22:42 satyrs 阅读(635) 评论(0) 推荐(0) 编辑

Reservoir Sampling

摘要：若S为1-10 ， k=3，则R初始为1,2,3 i=4时，1-4随机选取 4则1/4，1-3则3/4. 3, 将4赋值给R[j]->1,2,4 2->1,4,3 1->4,2,3 4->1,2,3 在1-4中随机取3个数即以上四种情况，并且保证了每种情况概率为1/4. 以上为举例，数学证明同理。阅读全文

posted @ 2017-10-06 02:48 satyrs 阅读(84) 评论(0) 推荐(0) 编辑

history server conf

摘要： spark.history.updateInterval 默认值：10 以秒为单位，更新日志相关信息的时间间隔 spark.history.retainedApplications 默认值：50 在内存中保存Application历史记录的个数，如果超过这个值，旧的应用程序信息将被删除，当再次访问已阅读全文

posted @ 2017-10-06 02:24 satyrs 阅读(113) 评论(0) 推荐(0) 编辑

optimization & error -02

摘要： shuffle磁盘IO时间长设置spark.local.dir为多个磁盘，并设置磁盘的IO速度快的磁盘，通过增加IO来优化shuffle性能 map|reduce数量大，造成shuffle小文件数目多 spark.shuffle.consolidateFiles为true，来合并shuffle中间阅读全文

posted @ 2017-10-06 02:19 satyrs 阅读(111) 评论(0) 推荐(0) 编辑

coalesce

摘要： repartition(numPartitions:Int):RDD[T] coalesce(numPartitions:Int，shuffle:Boolean=false):RDD[T] 同：RDD的分区进行重新划分异：repatition是coalesce一种情况，即分区增加，shuffle默阅读全文

posted @ 2017-10-06 01:55 satyrs 阅读(606) 评论(0) 推荐(0) 编辑