thrift,protobuf,avro序列化对比
对比thrift使用TCompactProtocol协议,protobuf使用,以及avro使用AvroKeyOutputFormat格式进行序列化对数据进行序列化后数据量大小
由于thrift的binary数据类型不能再次序列化化成二进制,所以测试的schema中没有binary类型的字段
1.avro schema
测试数据的avro schema定义如下
{ "namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [ { "name":"string1", "type":"string" }, { "name":"int1", "type":"int" }, { "name":"tinyint1", "type":"int" }, { "name":"smallint1", "type":"int" }, { "name":"bigint1", "type":"long" }, { "name":"boolean1", "type":"boolean" }, { "name":"float1", "type":"float" }, { "name":"double1", "type":"double" }, { "name":"list1", "type":{"type":"array", "items":"string"} }, { "name":"map1", "type":{"type":"map", "values":"int"} }, { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } }, { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} }, { "name":"nullableint", "type":["int", "null"] } ] }
2.Thrift schema
测试数据的thrift schema定义如下
namespace java com.linkedin.haivvreo struct struct1_name{ 1: required i32 sInt; 2: required bool sBoolean; 3: required string sString; } enum enum1_values { BLUE, RED, GREEN } struct union1{ 1: optional double member0; 2: optional bool member1; 3: optional string member2; } struct test_serializer{ 1: required string string1; 2: required i32 int1; 3: required i32 tinyint1; 4: required i32 smallint1; 5: required i64 bigint1; 6: required bool boolean1; 7: required double float1; 8: required double double1; 9: required list<string> list1; 10: required map<string, i32> map1; 11: required struct1_name struct1; 12: required string enum1; 13: optional i32 nullableint }
3.protobuf schema
syntax = "proto3"; package com.linkedin.haivvreo; message Struct1Name { int32 sInt = 1; bool sBoolean = 2; string sString = 3; } enum Enum1Values { BLUE = 0; //proto3版本中,首成员必须为0,成员不应有相同的值 RED = 1; GREEN = 2; } message TestSerializer { string string1 = 1; int32 int1 = 2; int32 tinyint1 = 3; int32 smallint1 = 4; int64 bigint1 = 5; bool boolean1 = 6; double float1 = 7; double double1 = 8; repeated string list1 = 9; map<string, int32> map1 = 10; Struct1Name struct1 = 11; Enum1Values enum1 = 12; int32 nullableint = 13; }
编译protobuf schema
protoc -I=./ --java_out=src/main/java/ ./src/main/proto3/test_serializer.proto
4.测试过程
数据内容如下,使用代码随机生成thrift object
val obj = new test_serializer() obj.setString1(RandomStringUtils.randomAlphanumeric(10)) obj.setInt1(new java.util.Random().nextInt(100000)) obj.setTinyint1(new java.util.Random().nextInt(100)) obj.setSmallint1(new java.util.Random().nextInt(10)) obj.setBigint1(new java.util.Random().nextLong()) obj.setBoolean1(new java.util.Random().nextBoolean()) obj.setFloat1(new java.util.Random().nextFloat()) obj.setDouble1(new java.util.Random().nextDouble()) val cs1 = RandomStringUtils.randomAlphanumeric(10): CharSequence val cs2 = RandomStringUtils.randomAlphanumeric(10): CharSequence obj.setList1(List(cs1, cs2).asJava) val map: java.util.Map[CharSequence, Integer] = HashMap(cs1 -> new java.util.Random().nextInt(10000), cs2 -> new java.util.Random().nextInt(10000)) .map(line => (line._1, (Integer.valueOf(line._2)))).asJava obj.setMap1(map) val struct1 = new struct1_name struct1.setSInt(new java.util.Random().nextInt(1000000)) struct1.setSBoolean(new java.util.Random().nextBoolean()) struct1.setSString(RandomStringUtils.randomAlphanumeric(10)) obj.setStruct1(struct1) val enum1 = enum1_values.BLUE obj.setEnum1(enum1) obj.setNullableint(new java.util.Random().nextInt(10000)) obj
如下
如果是avro object的话,可以从avro java class生成
val rdd = sc.parallelize(Seq(1,1,1,1,1)) val rdd2 = rdd.map{ line => val avroSchema = new Schema.Parser().parse(schemasStr) val avroRecord = new GenericData.Record(avroSchema) avroRecord.put("firstname", "hello") avroRecord.put("lastname", "world") avroRecord.put("age", 20) val childSchema = avroRecord.getSchema.getField("address").schema val childRecord = new GenericData.Record(childSchema) childRecord.put("streetaddress", "haidian") childRecord.put("city", "beijing") avroRecord.put("address", childRecord) avroRecord
也可以从avro schema生成
val rdd = spark.sparkContext.parallelize(Seq(1, 1, 1, 1, 1, 1, 1)) val rdd2 = rdd.map { line => val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("src/main/resources/schema.avro"))) val avroSchema = new Schema.Parser().parse(jsonFormatSchema) val datum = new GenericData.Record(avroSchema) datum.put("firstname", "xiao") datum.put("lastname", "ming") datum.put("age", 24) datum.put("test_field2", "test2") val childSchema = avroSchema.getField("address").schema() val childDatum = new GenericData.Record(childSchema) childDatum.put("streetaddress", "xierqi") childDatum.put("city", "beijing") datum.put("address", childDatum) datum.put("test_field", 222) datum }
spark rdd保存lzo文件
scala> import com.hadoop.compression.lzo.LzopCodec import com.hadoop.compression.lzo.LzopCodec scala> df.rdd.saveAsTextFile("/user/hive/warehouse/json_lzo", classOf[LzopCodec])
spark rdd保存snappy文件
scala> import org.apache.hadoop.io.compress.SnappyCodec import org.apache.hadoop.io.compress.SnappyCodec scala> df.repartition(1).rdd.saveAsTextFile("/user/hive/warehouse/json_snappy", classOf[SnappyCodec])
测试数据
序列化框架 | 格式 | 压缩/序列化方式 | 数据行数 | 文件数量 | 文件大小 |
avro | AvroKeyOutputFormat | null | 5250987 | 1 | 587.9 MB |
avro | AvroKeyOutputFormat | SNAPPY | 5250987 | 1 | 453.2 MB |
avro | AvroParquetOutputFormat | SNAPPY | 5250987 | 1 | 553.7 MB |
thrift | ParquetThriftOutputFormat | SNAPPY | 5250987 | 1 | 570.5 MB |
thrift | SequenceFileOutputFormat | TBinaryProtocol | 5250987 | 1 | 1.19 GB |
thrift | SequenceFileOutputFormat | TCompactProtocol | 5250987 | 1 | 788.7 MB |
thrift | SequenceFileOutputFormat | TCompactProtocol+DefaultCodec | 5250987 | 1 | 487.1 MB |
json | textfile | null | 5250987 | 1 | 1.84 GB |
json | textfile | gzip | 5250987 | 1 | 570.8 MB |
json | textfile | lzo | 5250987 | 1 | 716MB |
json | textfile | snappy | 5250987 | 1 | 727M |
本文只发表于博客园和tonglin0325的博客,作者:tonglin0325,转载请注明原文链接:https://www.cnblogs.com/tonglin0325/p/5463439.html