骑着蜗牛追火车

导航

 

官网传送:http://avro.apache.org/docs/current/

Introduction

Apache Avro™ is a data serialization system.

Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
  • 1、丰富的数据结构类型
  • 2、快速可压缩的二进制数据形式
  • 3、存储持久数据的文件容器
  • 4、远程过程调用 RPC
  • 5、简单的动态语言结合功能,Avro 和动态语言结合后,读写数据文件和使用 RPC 协议都不需要生成代码,而代码生成作为一种可选的优化只值得在静态类型语言中实现。

Schemas

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.

When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

Avro 和动态语言结合后,读/写数据文件和使用 RPC 协议都不需要生成代码,而代码生成作为一种可选的优化只需要在静态类型语言中实现。

当在 RPC 中使用 Avro 时,服务器和客户端可以在握手连接时进行shema交换。服务器和客户端有着彼此全部的模式,因此相同名称字段、缺失字段和多余字段等有关schema一致性问题就可以轻松解决。

还有,Avro 模式是用 JSON(一种轻量级的数据交换模式)定义的,这样对于已经拥有 JSON 库的语言可以容易实现。

Comparison with other systems

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

  • Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.

Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.

  • 1、动态类型:Avro 并不需要生成代码,模式和数据存放在一起,而模式使得整个数据的处理过程并不生成代码、静态数据类型等等。这方便了数据处理系统和语言的构造。
  • 2、未标记的数据:由于读取数据的时候模式是已知的,那么需要和数据一起编码的类型信息就很少了,这样序列化的规模也就小了。
  • 3、不需要用户指定字段号:即使模式改变,处理数据时新旧模式都是已知的,所以通过使用字段名称可以解决差异问题。
 

2. Schema

Schema 通过 JSON 对象表示。Schema 定义了简单数据类型和复杂数据类型,其中复杂数据类型包含不同属性。通过各种数据类型,用户可以自定义丰富的数据结构。

基本类型有:

类型说明
null no value
boolean a binary value
int 32-bit signed integer
long 64-bit signed integer
float single precision (32-bit) IEEE 754 floating-point number
double double precision (64-bit) IEEE 754 floating-point number
bytes sequence of 8-bit unsigned bytes
string unicode character sequence
 

Avro定义了六种复杂数据类型:

  • Record:record 类型,任意类型的一个命名字段集合,JSON对象表示。支持以下属性:
    • name:名称,必须
    • namespace
    • doc
    • aliases
    • fields:[一个 JSON 数组,必须]
      • name
      • doc
      • type
      • default
      • order
      • aliases
  • Enum:enum 类型,支持以下属性:
    • name:名称,必须
    • namespace
    • doc
    • aliases
    • symbols:枚举值,必须
  • Array:array 类型,未排序的对象集合,对象的模式必须相同。支持以下属性:
    • items
  • Map:map 类型,未排序的对象键/值对。键必须是字符串,值可以是任何类型,但必须模式相同。支持以下属性:
    • values
  • Fixed:fixed 类型,一组固定数量的8位无符号字节。支持以下属性:
    • name:名称,必须
    • namespace
    • size:每个值的 byte 长度
    • aliases
  • Union:union 类型,模式的并集,可以用JSON数组表示,每个元素为一个模式。

每一种复杂数据类型都含有各自的一些属性,其中部分属性是必需的,部分是可选的。

Download

Avro implementations for C, C++, C#, Java, PHP, Python, and Ruby can be downloaded from the Apache Avro™ Releases page. This guide uses Avro 1.9.2, the latest version at the time of writing. Download and unzip avro-1.9.2.tar.gz, and install via python setup.py (this will probably require root privileges). Ensure that you can import avro from a Python prompt.

   >pip install pycodestyle  在安装avro包前,需要先安装pycodestyle

$ tar xvf avro-1.9.2.tar.gz
$ cd avro-1.9.2
$ sudo python setup.py install
$ python
>>> import avro # should not raise ImportError

Alternatively:

python2.7>pip install avro
python2.7安装pip:
# 先获取get-pip.py文件
wget https://bootstrap.pypa.io/get-pip.py

# 运行该文件
python ./get-pip.py

Alternatively, you may build the Avro Python library from source. From your the root Avro directory, run the commands

$ cd lang/py/
$ ant
$ sudo python setup.py install
$ python
>>> import avro # should not raise ImportError
      

Defining a schema

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).

You can learn more about Avro schemas and types from the specification, but for now let's start with a simple schema example, user.avsc:

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}

This schema defines a record representing a hypothetical user. (Note that a schema file can only contain a single schema definition.) At minimum, a record definition must include its type ("type": "record"), a name ("name": "User"), and fields, in this case name, favorite_number, and favorite_color. We also define a namespace ("namespace": "example.avro"), which together with the name attribute defines the "full name" of the schema (example.avro.User in this case).

Fields are defined via an array of objects, each of which defines a name and type (other attributes are optional, see the record specification for more details). The type attribute of a field is another schema object, which can be either a primitive or complex type. For example, the name field of our User schema is the primitive type string, whereas the favorite_number and favorite_color fields are both unions, represented by JSON arrays. unions are a complex type that can be any of the types listed in the array; e.g., favorite_number can either be an int or null, essentially making it an optional field.

unions 是一种复杂的数据类型,unions中的任何值都是可被取到的值;这里 favorite_number 可以即是 int 类型,也可以是 null 类型。
 

Serializing and deserializing without code generation

Data in Avro is always stored with its corresponding schema, meaning we can always read a serialized item, regardless of whether we know the schema ahead of time. This allows us to perform serialization and deserialization without code generation. Note that the Avro Python library does not support code generation.

ry running the following code snippet, which serializes two users to a data file on disk, and then reads back and deserializes the data file:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("user.avsc", "rb").read())

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
writer.close()

reader = DataFileReader(open("users.avro", "rb"), DatumReader())
for user in reader:
    print user
reader.close()
      

This outputs:

{u'favorite_color': None, u'favorite_number': 256, u'name': u'Alyssa'}
{u'favorite_color': u'red', u'favorite_number': 7, u'name': u'Ben'}
      

Do make sure that you open your files in binary mode (i.e. using the modes wb or rb respectively). Otherwise you might generate corrupt files due to automatic replacement of newline characters with the platform-specific representations.

Let's take a closer look at what's going on here.

schema = avro.schema.parse(open("user.avsc", "rb").read())
      

avro.schema.parse takes a string containing a JSON schema definition as input and outputs a avro.schema.Schema object (specifically a subclass of Schema, in this case RecordSchema). We're passing in the contents of our user.avsc schema file here.

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
      

We create a DataFileWriter, which we'll use to write serialized items to a data file on disk. The DataFileWriter constructor takes three arguments:

  • The file we'll serialize to
  • A DatumWriter, which is responsible for actually serializing the items to Avro's binary format.
  • The schema we're using. The DataFileWriter needs the schema both to write the schema to the data file, and to verify that the items we write are valid items and write the appropriate fields.
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
        

We use DataFileWriter.append to add items to our data file. Avro records are represented as Python dicts. Since the field favorite_color has type ["int", "null"], we are not required to specify this field, as shown in the first append. Were we to omit the required name field, an exception would be raised. Any extra entries not corresponding to a field are present in the dict are ignored.

{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}
由于 "favorite_number" 和 "favorite_color" type 中有null属性,所以我们不一定非要指定它们,而如果没有明确指定 “name” 的值,则会抛出异常(写入到schema的值与schema不匹配.)
 
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
        

We open the file again, this time for reading back from disk. We use a DataFileReader and DatumReader analagous to the DataFileWriter and DatumWriter above.

for user in reader:
    print user
        

The DataFileReader is an iterator that returns dicts corresponding to the serialized items.

DataFileReader是一个迭代器对象,返回字典中符合的序列化items.
 
 
 
posted on 2020-05-27 10:53  骑着蜗牛追火车  阅读(701)  评论(0编辑  收藏  举报