原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3833985.html
最近在使用spark开发过程中发现当数据量很大时,如果cache数据将消耗很多的内存。为了减少内存的消耗,测试了一下 Kryo serialization的使用
代码包含三个类,KryoTest、MyRegistrator、Qualify。
我们知道在Spark默认使用的是Java自带的序列化机制。如果想使用Kryo serialization,只需要添加KryoTest类中的红色部分,指定spark序列化类
另外还需要增加MyRegistrator类,注册需要用Kryo序列化的类
1 public class KryoTest { 2 public static void main(String[] args) { 3 SparkConf conf = new SparkConf(); 4 conf.setMaster("local"); 5 conf.setAppName("KryoTest"); 6 conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); 7 conf.set("spark.kryo.registrator", "MyRegistrator"); 8 9 JavaSparkContext sc = new JavaSparkContext(conf); 10 11 JavaRDD<String> rdd = sc.textFile("/home/hdpusr/qualifying.txt"); 12 JavaRDD<Qualify> map = rdd.map(new Function<String, Qualify>() { 13 /* (non-Javadoc) 14 * @see org.apache.spark.api.java.function.Function#call(java.lang.Object) 15 */ 16 public Qualify call(String v1) throws Exception { 17 // TODO Auto-generated method stub 18 String s[] = v1.split(","); 19 Qualify q = new Qualify(); 20 q.setA(Integer.parseInt(s[0])); 21 q.setB(Long.parseLong(s[1])); 22 q.setC(s[2]); 23 24 25 return q; 26 } 27 }); 28 map.persist(StorageLevel.MEMORY_AND_DISK_SER()); 29 System.out.println(map.count()); 30 } 31 }
1 import org.apache.spark.serializer.KryoRegistrator; 2 3 import com.esotericsoftware.kryo.Kryo; 4 5 public class MyRegistrator implements KryoRegistrator{ 6 /* (non-Javadoc) 7 * @see org.apache.spark.serializer.KryoRegistrator#registerClasses(com.esotericsoftware.kryo.Kryo) 8 */ 9 public void registerClasses(Kryo arg0) { 10 // TODO Auto-generated method stub 11 arg0.register(Qualify.class); 12 } 13 }
1 import java.io.Serializable; 2 3 4 public class Qualify implements Serializable{ 5 int a; 6 long b; 7 String c; 8 public int getA() { 9 return a; 10 } 11 public void setA(int a) { 12 this.a = a; 13 } 14 public long getB() { 15 return b; 16 } 17 public void setB(long b) { 18 this.b = b; 19 } 20 public String getC() { 21 return c; 22 } 23 public void setC(String c) { 24 this.c = c; 25 } 26 27 }
下面我们看看使用Java serialization 与Kryo serialization的效果对比
Java serialization
Kryo serialization
从实际跑的数据可以看出还是能节省不少内存的。当内存不够用的时候建议使用Kryo serialization这种方式
原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3833985.html