Avro序列化、反序列化详解及自定义工具类

1、前言

Avro序列化的API主要有两种,SpecificDatumWriter / SpecificDatumReader及DataFileWriter / DataFileReader,后者是对前者的封装。两者的特点分别介绍如下:

2、SpecificDatumWriter / SpecificDatumReader

2.1 SpecificDatumWriter序列化

SpecificDatumWriter序列化一条或多条记录

 1 public static ByteArrayOutputStream serializePrimary(Schema schema, List<GenericRecord> records) throws IOException{
 2     DatumWriter<GenericRecord> datumWriter = new SpecificDatumWriter<GenericRecord>(schema);
 3     ByteArrayOutputStream out = new ByteArrayOutputStream();
 4     BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out , null);
 5 
 6     //多次调用write方法
 7     for(GenericRecord record : records){
 8         datumWriter.write(record , encoder);
 9         encoder.flush();
10     }
11     return out;
12 }

2.2 SpecificDatumReader反序列化

SpecificDatumReader反序列化获得一条或多条记录

 1 public static List<GenericRecord> deserializeMulPrimary(Schema schema, ByteArrayOutputStream out) throws IOException{
 2     DatumReader<GenericRecord> datumReader = new SpecificDatumReader<GenericRecord>(schema);
 3     Decoder decoder=DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
 4         
 5     List<GenericRecord> records = new ArrayList<GenericRecord>();
 6     while(true){
 7         try {
 8             GenericRecord record = datumReader.read(null, decoder);
 9             records.add(record);
10         } catch (EOFException eof) {
11             //读取到字节流的末尾时,结束循环
12             break;
13         }
14     }
15     return records;
16 }

2.3 特点

a 序列化后的内容中不含有schema信息
b 反序列化时必须有schema信息(因为序列化记录中没有schema信息)
c 主要以内存为存储媒介
d 可以序列化和反序列化获得一条或多条记录

3、 DataFileWriter / DataFileReader

3.1 DataFileWriter序列化

将数据序列化到内存中

 1 public static ByteArrayOutputStream serializeToMemory(Schema schema, List<GenericRecord> records) throws IOException{
 2     DatumWriter<GenericRecord> datumWriter = new SpecificDatumWriter<GenericRecord>(schema);
 3     DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
 4         
 5     ByteArrayOutputStream out = new ByteArrayOutputStream();
 6     //先现将Schema写入到内存中
 7     fileWriter.create(schema, out);
 8     //再开始追加多条GenericRecord记录
 9     for(GenericRecord record : records){
10         fileWriter.append(record);
11     }
12     fileWriter.close();
13     return out;
14 }

将数据序列化到avro文件中

 1 public static File serializeToFile(Schema schema, List<GenericRecord> records, String fileDirectory) throws IOException{
 2     DatumWriter<GenericRecord> datumWriter = new SpecificDatumWriter<GenericRecord>(schema);
 3     DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
 4         
 5     File file = new File(fileDirectory + "/" + System.currentTimeMillis() + "-" + UUID.randomUUID().toString() + ".avro");
 6         
 7     //先将schema信息添加到文件中
 8     fileWriter.create(schema, file);
 9     //再开始追加GenericRecord记录
10     for(GenericRecord record : records){
11         fileWriter.append(record);
12     }
13     fileWriter.close();
14     return file;
15 }

3.1 DataFileReader反序列化

从内存中反序列化获得一条或多条记录

 1 public static List<GenericRecord> deserializeFromMemory(ByteArrayOutputStream out) throws IOException{
 2     DatumReader<GenericRecord> datumReader = new SpecificDatumReader<GenericRecord>();
 3     SeekableByteArrayInput sin = new SeekableByteArrayInput(out.toByteArray());
 4     DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(sin, datumReader);
 5         
 6     List<GenericRecord> records = new ArrayList<GenericRecord>();
 7     while(fileReader.hasNext()){
 8         records.add(fileReader.next());
 9     }
10     fileReader.close();
11     return records;
12 }

从avro文件中反序列化获得一条或多条记录

 1 public static List<GenericRecord> deserializeFromFile(File file) throws IOException{
 2     DatumReader<GenericRecord> datumReader = new SpecificDatumReader<GenericRecord>();
 3     DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(file, datumReader);
 4         
 5     List<GenericRecord> records = new ArrayList<GenericRecord>();
 6     while(fileReader.hasNext()){
 7         records.add(fileReader.next());
 8     }
 9     fileReader.close();
10     return records;
11 }

3.3 特点

a 序列化后的内容中含有Schema信息
b 反序列化时就不再需要Schema信息,因为序列化的内容中已经含有Schema信息
c 可以以内存为存储媒介,也可以以文件为存储媒介

posted @ 2017-11-14 16:46  simple-clean-opt  Views(2745)  Comments(0Edit  收藏  举报