Spark RDD编程(博客索引,日常更新)

本篇主要是记录自己在中解决RDD编程性能问题中查阅的论文博客,为我认为写的不错的建立索引方便查阅,我的总结会另立他篇


1)通过分区(Partitioning)提高spark性能
https://blog.csdn.net/qq_32649581/article/details/83029852

 

2)DataFrame的repartition、partitionBy、coalesce区别

 https://blog.csdn.net/u010720408/article/details/90229461

 

3)spark核心构件之partitioner
https://www.jianshu.com/p/67fff2e477fa

 

4)Spark中cache和persist的作用以及存储级别

https://blog.csdn.net/qq_20641565/article/details/76216417

 

4)数据倾斜原因及解决方案
https://blog.csdn.net/qq_38247150/article/details/80366769

https://www.cnblogs.com/qiuhong10/p/7762532.html

 

4)水塘抽样(Reservoir Sampling)问题
理论基础 https://www.cnblogs.com/strugglion/p/6424874.html
RangePartitioner https://blog.csdn.net/u011564172/article/details/54380574

 

Spark异常处理

1)Spark异常处理——Shuffle FetchFailedException

https://www.jianshu.com/p/23182ea3892d

posted @ 2019-10-21 11:39  梦里繁花  阅读(250)  评论(0编辑  收藏  举报