spark-source code - 随笔分类 - satyrs

BlockManager

摘要：dropFromMemory（info查看是否可drop，drop之后report给manager，info清除） 1）TimeStampedHashMap[BlockId,BlockInfo] 是否存在可drop的block=> 从info中获得StorageLevel 2）StorageLeve 阅读全文

posted @ 2017-10-08 02:37 satyrs 阅读(230) 评论(0) 推荐(0)

basic spark or spark essentials-02(notes)

摘要：submitjob：：做了什么 1含有dagScheduler的runJob函数的runJob是入口，并且是堵塞的操作,即直到Spark完成Job的运行之前,rdd.doCheckpoint()是不会执行的。堵塞在3的waiter.awaitResult()操作,即submitJob会返回一个wai 阅读全文

posted @ 2017-10-07 19:35 satyrs 阅读(120) 评论(0) 推荐(0)

dependency & DF & DataSet & patitioner

摘要：dependecy narrow :onetoone prune range wide :shuffle 查看依赖： .dependecies .toDebugString DF catalyst:(sql's query optimizer) reordering operations reduc 阅读全文

posted @ 2017-10-06 22:42 satyrs 阅读(651) 评论(0) 推荐(0)

history server conf

摘要：spark.history.updateInterval 默认值：10 以秒为单位，更新日志相关信息的时间间隔 spark.history.retainedApplications 默认值：50 在内存中保存Application历史记录的个数，如果超过这个值，旧的应用程序信息将被删除，当再次访问已阅读全文

posted @ 2017-10-06 02:24 satyrs 阅读(122) 评论(0) 推荐(0)

optimization & error -02

摘要：shuffle磁盘IO时间长设置spark.local.dir为多个磁盘，并设置磁盘的IO速度快的磁盘，通过增加IO来优化shuffle性能 map|reduce数量大，造成shuffle小文件数目多 spark.shuffle.consolidateFiles为true，来合并shuffle中间阅读全文

posted @ 2017-10-06 02:19 satyrs 阅读(118) 评论(0) 推荐(0)

coalesce

摘要：repartition(numPartitions:Int):RDD[T] coalesce(numPartitions:Int，shuffle:Boolean=false):RDD[T] 同：RDD的分区进行重新划分异：repatition是coalesce一种情况，即分区增加，shuffle默阅读全文

posted @ 2017-10-06 01:55 satyrs 阅读(623) 评论(0) 推荐(0)

optimization & error -01

摘要：调优都是在场景限制之下。大部分选择并非一定。做测试来寻找瓶颈。（shuffle操作数量、RDD持久化操作数量以及gc）开发调优、资源调优、数据倾斜调优、shuffle调优几个部分。（涉及代码质量（api及数据结构），参数，数据质量，考虑内存与网络而选择的模式（广播、序列化），官网建议） RDD（阅读全文

posted @ 2017-10-06 01:43 satyrs 阅读(184) 评论(0) 推荐(0)

SequenceFile & SequenceFileInputFormat<K,V>

摘要：org.apache.hadoop.mapred.SequenceFileInputFormat<K,V> org.apache.hadoop.io.SequenceFile 阅读全文

posted @ 2017-10-01 02:46 satyrs 阅读(353) 评论(0) 推荐(0)

schema inference(parsing)

摘要：So, how to infer? in JAVA Platform using xsd-gen-0.2.0-jar-with-dependencies.jar or xbean-2.2.0.jar. input output can be used to get the response SOAP 阅读全文

posted @ 2017-09-28 18:02 satyrs 阅读(154) 评论(0) 推荐(0)

Dataset.scala(sql)

摘要：1 object Dataset private to sql leveltest & errors: :后为解释source code内容； //为插入分析1 spark.read.textFile("...")textFile: org.apache.spark.sql.Dataset[Stri 阅读全文

posted @ 2017-09-28 16:46 satyrs 阅读(252) 评论(0) 推荐(0)

semi-structured data(notes)

摘要：data management data model , schema data model: colletion of concepets for describing data schema: using model, a description of a particular collecti 阅读全文

posted @ 2017-09-27 00:41 satyrs 阅读(251) 评论(0) 推荐(0)

build jar(sbt)

摘要：project 中遇到 example: .sbt .sh 阅读全文

posted @ 2017-09-26 23:14 satyrs 阅读(149) 评论(0) 推荐(0)

basic spark or spark essentials-01(notes)

摘要：parallelized,lazily transform,cache(),actions 算子算子是RDD中定义的函数，可以对RDD中的数据进行转换和操作。数据转化为Spark中的数据块，RDD就是一组分区，物理上是元数据结构存储映射关系，物理对应一个block。通过BlockManager进行阅读全文

posted @ 2017-09-26 23:00 satyrs 阅读(154) 评论(0) 推荐(0)

Spark+Kafka(project)

摘要：地址：https://github.com/yuqingwang15/kafka-spark 案例实时统计每秒中男女生购物人数，因此针对每条购物日志，我们只需要获取gender即可，然后发送给Kafka，接下来Spark Streaming再接收gender进行处理。 1 应用程序将购物日志发送给阅读全文

posted @ 2017-09-26 14:33 satyrs 阅读(594) 评论(0) 推荐(0)

build jar(intellij)

摘要：File->Project Structure Artifacts->绿色加号->Jar->From moduleswith dependencies... Main Class->Search by Name->Apply->OK 其他选项都删除，只保留了Name.jar以及Name compil 阅读全文

posted @ 2017-09-26 10:21 satyrs 阅读(249) 评论(0) 推荐(0)

RDD（google rdd paper notes）

摘要：RDD Twister HaLoop Dryad MR Pregel.... 多个并行操作重用中间结果-抽象自动容错、位置感知性调度和可伸缩性容错：数据检查点和记录数据的更新RDD只支持粗粒度转换，即在大量记录上执行的单个操作。将创建RDD的一系列转换记录下来（即Lineage），以便恢复丢失的分阅读全文

posted @ 2017-09-23 22:46 satyrs 阅读(186) 评论(0) 推荐(0)

spark 01

摘要：debug environment http://spark.apache.org/docs/latest/building-spark.html 1spark-shell →spark-submit→(SparkSubmit)spark-class 2open jvm→thread dump→ma 阅读全文

posted @ 2017-09-20 05:03 satyrs 阅读(139) 评论(0) 推荐(0)

随笔分类 - spark-source code