Avro序列化、反序列化详解及自定义工具类
1、前言
Avro序列化的API主要有两种,SpecificDatumWriter / SpecificDatumReader及DataFileWriter / DataFileReader,后者是对前者的封装。两者的特点分别介绍如下:
2、SpecificDatumWriter / SpecificDatumReader
2.1 SpecificDatumWriter序列化
SpecificDatumWriter序列化一条或多条记录
1 public static ByteArrayOutputStream serializePrimary(Schema schema, List<GenericRecord> records) throws IOException{ 2 DatumWriter<GenericRecord> datumWriter = new SpecificDatumWriter<GenericRecord>(schema); 3 ByteArrayOutputStream out = new ByteArrayOutputStream(); 4 BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out , null); 5 6 //多次调用write方法 7 for(GenericRecord record : records){ 8 datumWriter.write(record , encoder); 9 encoder.flush(); 10 } 11 return out; 12 }
2.2 SpecificDatumReader反序列化
SpecificDatumReader反序列化获得一条或多条记录
1 public static List<GenericRecord> deserializeMulPrimary(Schema schema, ByteArrayOutputStream out) throws IOException{ 2 DatumReader<GenericRecord> datumReader = new SpecificDatumReader<GenericRecord>(schema); 3 Decoder decoder=DecoderFactory.get().binaryDecoder(out.toByteArray(), null); 4 5 List<GenericRecord> records = new ArrayList<GenericRecord>(); 6 while(true){ 7 try { 8 GenericRecord record = datumReader.read(null, decoder); 9 records.add(record); 10 } catch (EOFException eof) { 11 //读取到字节流的末尾时,结束循环 12 break; 13 } 14 } 15 return records; 16 }
2.3 特点
a 序列化后的内容中不含有schema信息
b 反序列化时必须有schema信息(因为序列化记录中没有schema信息)
c 主要以内存为存储媒介
d 可以序列化和反序列化获得一条或多条记录
3、 DataFileWriter / DataFileReader
3.1 DataFileWriter序列化
将数据序列化到内存中
1 public static ByteArrayOutputStream serializeToMemory(Schema schema, List<GenericRecord> records) throws IOException{ 2 DatumWriter<GenericRecord> datumWriter = new SpecificDatumWriter<GenericRecord>(schema); 3 DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter); 4 5 ByteArrayOutputStream out = new ByteArrayOutputStream(); 6 //先现将Schema写入到内存中 7 fileWriter.create(schema, out); 8 //再开始追加多条GenericRecord记录 9 for(GenericRecord record : records){ 10 fileWriter.append(record); 11 } 12 fileWriter.close(); 13 return out; 14 }
将数据序列化到avro文件中
1 public static File serializeToFile(Schema schema, List<GenericRecord> records, String fileDirectory) throws IOException{ 2 DatumWriter<GenericRecord> datumWriter = new SpecificDatumWriter<GenericRecord>(schema); 3 DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter); 4 5 File file = new File(fileDirectory + "/" + System.currentTimeMillis() + "-" + UUID.randomUUID().toString() + ".avro"); 6 7 //先将schema信息添加到文件中 8 fileWriter.create(schema, file); 9 //再开始追加GenericRecord记录 10 for(GenericRecord record : records){ 11 fileWriter.append(record); 12 } 13 fileWriter.close(); 14 return file; 15 }
3.1 DataFileReader反序列化
从内存中反序列化获得一条或多条记录
1 public static List<GenericRecord> deserializeFromMemory(ByteArrayOutputStream out) throws IOException{ 2 DatumReader<GenericRecord> datumReader = new SpecificDatumReader<GenericRecord>(); 3 SeekableByteArrayInput sin = new SeekableByteArrayInput(out.toByteArray()); 4 DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(sin, datumReader); 5 6 List<GenericRecord> records = new ArrayList<GenericRecord>(); 7 while(fileReader.hasNext()){ 8 records.add(fileReader.next()); 9 } 10 fileReader.close(); 11 return records; 12 }
从avro文件中反序列化获得一条或多条记录
1 public static List<GenericRecord> deserializeFromFile(File file) throws IOException{ 2 DatumReader<GenericRecord> datumReader = new SpecificDatumReader<GenericRecord>(); 3 DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(file, datumReader); 4 5 List<GenericRecord> records = new ArrayList<GenericRecord>(); 6 while(fileReader.hasNext()){ 7 records.add(fileReader.next()); 8 } 9 fileReader.close(); 10 return records; 11 }
3.3 特点
a 序列化后的内容中含有Schema信息
b 反序列化时就不再需要Schema信息,因为序列化的内容中已经含有Schema信息
c 可以以内存为存储媒介,也可以以文件为存储媒介