XGogo - 博客园

2016年7月30日

摘要： PairRDD 有几个比较麻烦的算子，常理解了后面又忘记了，自己按照自己的理解记录好，以备查阅 1、aggregateByKey aggregate 是聚合意思，直观理解就是按照Key进行聚合。转化： RDD[(K,V)] ==> RDD[(K,U)] 可以看出是返回值的类型不需要和原来的RDD的阅读全文

posted @ 2016-07-30 21:08 XGogo 阅读(693) 评论(0) 推荐(0) 编辑

2016年7月26日

How-to: Tune Your Apache Spark Jobs (Part 1)

摘要： Learn techniques for tuning your Apache Spark jobs for optimal efficiency. When you write Apache Spark code and page through the public APIs, you come 阅读全文

posted @ 2016-07-26 22:23 XGogo 阅读(219) 评论(0) 推荐(0) 编辑

Along with all the above benefits, you cannot overlook the space efficiency and performance gains in using DataFrames and Dataset APIs for two reasons.

摘要： Of all the developers’ delight, a set of APIs that makes them productive, that are easy to use, and that are intuitive and expressive is the most attr 阅读全文

posted @ 2016-07-26 22:22 XGogo 阅读(235) 评论(0) 推荐(0) 编辑

2016年7月23日

《SPARK/TACHYON:基于内存的分布式存储系统》－史鸣飞（英特尔亚太研发有限公司大数据软件部工程师）

摘要：史鸣飞：大家好，我是叫史鸣飞，来自英特尔公司，接下来我向大家介绍一下Tachyon。我事先想了解一下大家有没有听说过Tachyon，或者是对Tachyon有没有一些了解？对Spark呢？首先做一个介绍，我来自英特尔的大数据团队，我们团队主要是致力于各种大数据的软件开发以及这些软件在工业界的推广和应阅读全文

posted @ 2016-07-23 23:15 XGogo 阅读(455) 评论(0) 推荐(0) 编辑

2016年7月18日

Spark大师之路：广播变量（Broadcast）源码分析

摘要：概述最近工作上忙死了……广播变量这一块其实早就看过了，一直没有贴出来。本文基于Spark 1.0源码分析，主要探讨广播变量的初始化、创建、读取以及清除。类关系 BroadcastManager类中包含一个BroadcastFactory对象的引用。大部分操作通过调用BroadcastFacto 阅读全文

posted @ 2016-07-18 17:26 XGogo 阅读(664) 评论(0) 推荐(0) 编辑

Redis on Spark:Task not serializable

摘要： We use Redis on Spark to cache our key-value pairs.This is the code: import com.redis.RedisClient val r = new RedisClient("192.168.1.101", 6379) val p 阅读全文

posted @ 2016-07-18 15:22 XGogo 阅读(821) 评论(0) 推荐(0) 编辑

2016年7月17日

一次Spark应用程序参数优化案例

摘要：并行度对于*ByKey等需要shuffle而生成的RDD，其Partition数量依如下顺序确定：1. 方法的第二个参数 > 2. spark.default.parallelism参数 > 3. 所有依赖的RDD中，Partition最多的RDD的Partition的数量。对于其他的RDD则其阅读全文

posted @ 2016-07-17 18:13 XGogo 阅读(2611) 评论(0) 推荐(0) 编辑

Spark性能优化(2)——广播变量、本地缓存目录、RDD操作、数据倾斜

摘要：转自：http://blog.cheyo.net/104.html 广播变量背景一般Task大小超过10K时（Spark官方建议是20K），需要考虑使用广播变量进行优化。大表小表Join，小表使用广播的方式，减少Join操作。参考：Spark广播变量与累加器 Local Dir 背景 shuf 阅读全文

posted @ 2016-07-17 18:03 XGogo 阅读(335) 评论(0) 推荐(0) 编辑

Spark性能优化(1)——序列化、内存、并行度、数据存储格式、Shuffle

摘要：序列化背景：在以下过程中，需要对数据进行序列化：性能优化点： Spark默认的序列化类型是Java序列化。Java序列化的优势是兼容性好，不需要自已注册类。劣势是性能差。为提升性能，建议使用Kryo序列化替代默认的Java序列化。Kryo序列化的优势是速度快，体积小，劣势是兼容性差，需要自已注阅读全文

posted @ 2016-07-17 18:01 XGogo 阅读(995) 评论(0) 推荐(0) 编辑

Java – Convert IP address to Decimal Number

摘要： In this tutorial, we show you how to convert an IP address to its decimal equivalent in Java, and vice versa. For examples : Bash Bash 1. IP Address t 阅读全文

posted @ 2016-07-17 17:50 XGogo 阅读(849) 评论(0) 推荐(0) 编辑

尧字节

明翼

公告