protobuf protocol-buffers 序列化数据 gobs pickling string XML 用C实现的cPickle比pickle快1000倍 protobuf2 protobuf3 差异 数据类型
数据类型
用Python进行gRPC接口测试(三)
一、标量值类型
标量值类型与我们在编程语言使用的基本数据类型概念类似,用来携带的数据也大体相同。在python中,这些标量值类型都能找到与之对应的python数据类型,处理起来简单便捷。
使用举例:
message Student {
string name = 1;
int32 age = 2;
bool sex = 3;
}
Python实现代码:
name="小王"
age=15
sex=True
#方式1
student=Student(name=name,age=age,sex=sex)
#方式2
student=Student()
student.name=name
student.age=age
student.sex=sex
二、一些特殊类型
除了上面提到的标量值类型,proto3中还定义了其他一些特殊的数据类型,方便我们用来构造、传递各种复杂的数据结构。官方给出了一个关于这些类型的JSON映射表,可以直观地看到各种类型所含数据的基本结构。
这些类型使得我们可以方便地构造出各种各样的数据形式。这其中有几个较为常用的类型,在小编进行的测试中经常遇到,下面我们就结合实际中的例子来为大家介绍一下。
1、message
message,根据映射表我们可以看到,它类似于我们在编程语言中所使用的类的对象(object)。在一个类中,我们可以添加各种其他类型的数据,也包括类本身。通过类比,message也有类似的概念,我们可以在里面添加各种proto类型的数据,也包括message。其实正如message的名字一样——消息,它是protobuf中的核心类型,在grpc接口中,我们正是通过发送和接收消息来完成数据交互,来实现对应的功能。
简单的message:
message Person {
int32 id = 1;
string name = 2;
string email = 3;
}
含其他message的message:
message Point {
int32 latitude = 1;
int32 longitude = 2;
}
message Feature {
string name = 1;
Point location = 2;
}
在Python中的使用:
location=Point(latitude=5,longitude=10)
Feature=Feature(name="我是个名字",location=location)
2、Timestamp、Duration
这两种类型都是关于时间的,Timestamp是时间戳,Duration表示的时间长度。
在AI平台账号服务的测试中,某Account类型的message定义如下:
message Account {
string account_id = 1;
}
在Python中的使用:
update_at=Timestamp()
#从字符串获取
update_at.FromJsonString("1970-01-01T00:00:00Z")
#获取当前时间
update_at.GetCurrentTime()
time_limit=Duration()
#从纳秒转换
time_limit.FromNanoseconds(1999999999)
#从秒转换
time_limit.FromSeconds(100)
account=Account(account_id="account1",update_at=update_at,time_limit=time_limit)
3、Any
Any类型比较特殊,它可以包含不同的message,结合pack和unpack,只需声明一个Any,即可传递各种类型的message而不用声明多个字段。
在大会同传项目中,某个请求的message中需要传递两种信息——图片和音频,于是通过Any类型来实现同一字段的复用:
message ImageData {
string index = 1;
bytes image = 2;
}
message Data {
string appid = 1;
bytes payload = 2;
string extra = 3;
}
message Request {
}
在Python中的使用:
imageData=msg_pb2.ImageData(index="001",image=open("1.jpg","rb").read())
req1=msg_pb2.Request()
req1.body.Pack(imageData)
data=msg_pb2.Data(name="no.1",payload=open("1.wav","rb").read(),extra="no use")
req=msg_pb2.Request()
req.body.Pack(data)
4、enum
enum枚举类型和其他大多数编程语言的枚举类型概念相同,主要是通过提前设定好一些固定的值来限定可以传递的内容。
在AI平台实名认证服务的测试中,需要一个认证人类型的字段,由于认证人类型收敛,于是使用enum类型来定义:
enum PersonType {
PERSONTYPE_UNSPECIFIED = 0;
INDIVIDUAL = 1;
LEGAL = 2;
AUTHORIZE = 3;
}
message Person {
string real_name = 1;
PersonType person_type = 2;
}
在Python中的应用:
person_type=PersonType.Value("INDIVIDUAL")
Person(real_name="小王",person_type=person_type)
5、map
map相当于json中的键值对,在Python中类似于字典(dict),我们可以利用Python的dict类型数据来对map进行设置。map在proto中声明时一般会带有尖括号,来指定key和value的具体类型,如map<string,string>就表示键值对的key、value都为string类型。
在AI平台鉴权相关的测试中,需要为用户创建的应用绑定若干个不同的特殊属性,每个特殊属性对应着一个属性值,此处采用了map类型:
message App {
string appid = 1;
map<string, string> extra_informations = 2;
}
在Python中的应用:
extra_informations={"name":"app1","expired":"no"}
app=App(appid="1234567", extra_informations=extra_informations)
6、repeated
repeated相当于json中的list,在Python中类似于列表(list),我们可以利用Python的list类型数据来对repeated进行设置。
在AI平台账号服务的测试中,需要为账号添加各种不同的能力,每个能力有多个属性,而每个能力属性的种类和数据类型一致。此处采用了repeated类型:
message Audience {
string name = 1;
string tier = 2;
}
message Account {
string account_id = 1;
repeated Audience audience = 2;
}
在Python中的应用:
audience=[{"name":"ASR","tier":"stand"},{"name":"TTS","tier":"free"},{"name":"MT","tier":"stand"}]
account=Account(account_id="account1",audience=audience)
三、实际应用中的问题与技巧
1、repeated类型赋值问题
如果把上面所讲repeated类型例子中的Python代码改成如下形式,那么在运行时会报错:
audience=[{"name":"ASR","tier":"stand"},{"name":"TTS","tier":"free"},{"name":"MT","tier":"stand"}]
account=Account(account_id="account1")
account.audience=audience
错误信息:
AttributeError: Assignment not allowed to repeated field "name" in protocol message object.
这与我们上面所说的message的两种赋值方式似乎有所出入,但事实是因为protobuf中的repeated类型并不是我们想象的那样与python中的list完全对应,因此在这里会出现问题。所以在实际应用中,我们应避免这种写法,尽量采用上面例子中的方式。另外我们还可以采用另外一种方式来达到同样的效果:
audience=[{"name":"ASR","tier":"stand"},{"name":"TTS","tier":"free"},{"name":"MT","tier":"stand"}]
for audience1 in audience:
a=account.audience.add()
a.name=audience1['name']
a.tier=audience1['tier']
2、复杂message的数据构造问题
在实际测试的接口中,有时某个message的结构可能会非常复杂,比如像语音识别服务一些接口,协议里包含很多不同的message和repeated类型,这样对于我们编写测试客户端代码以及构造case、解析case都会有一些影响。之前我们介绍过使用命令行的方式传递参数的方式显然难以满足这种情景下的需求,手动拼message的方式也显得十分不便。经过一番调研发现,对于这种情况,我们可以使用protobuf库中json_format里面的Parse、MessageToJson两个方法来有效解决,这两个方法可以实现protobuf message和json的互转。因为处理json的方式有很多,也很灵活,因此我们在构造case时可以使用json的方式,通过Parse方法直接将json转换成message。在收到返回结果之后,可以使用MessageToJson方法将message转换成json,这样对于我们测试人员来说,发送和接收的数据看起来都是json,无论是准备测试数据还检验结果都会轻松不少。
示例:
from google.protobuf import json_format
json_obj='{"a1":1,"a2":2}'
request = json_format.Parse(json_obj,MessageName())
json_result = json_format.MessageToJson(request)
print (json_result)
其中MessageName为message的名称,json_result为转为json后的返回结果。
https://developers.google.com/protocol-buffers/
https://developers.google.com/protocol-buffers/docs/pythontutorial
Protocol Buffer Basics: Go | Protocol Buffers | Google Developers
https://developers.google.com/protocol-buffers/docs/gotutorial
一阶子探索:
1、python --- protocol-buffers --- golang
Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.
Protocol Buffers - Google's data interchange format
Protocol Buffer Basics: Go | Protocol Buffers | Google Developers
https://developers.google.com/protocol-buffers/docs/gotutorial
Protocol Buffer Basics: Python | Protocol Buffers | Google Developers
https://developers.google.com/protocol-buffers/docs/pythontutorial
序列化和获取结构数据的方式
How do you serialize and retrieve structured data like this? There are a few ways to solve this problem:
- Use gobs to serialize Go data structures. This is a good solution in a Go-specific environment, but it doesn't work well if you need to share data with applications written for other platforms.
- You can invent an ad-hoc way to encode the data items into a single string – such as encoding 4 ints as "12:3:-23:67". This is a simple and flexible approach, although it does require writing one-off encoding and parsing code, and the parsing imposes a small run-time cost. This works best for encoding very simple data.
- Serialize the data to XML. This approach can be very attractive since XML is (sort of) human readable and there are binding libraries for lots of languages. This can be a good choice if you want to share data with other applications/projects. However, XML is notoriously space intensive, and encoding/decoding it can impose a huge performance penalty on applications. Also, navigating an XML DOM tree is considerably more complicated than navigating simple fields in a class normally would be.
Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto
description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.
1/3、gobs -- golang pickling -- python
https://golang.org/pkg/encoding/gob/
Package gob manages streams of gobs - binary values exchanged between an Encoder (transmitter) and a Decoder (receiver). A typical use is transporting arguments and results of remote procedure calls (RPCs) such as those provided by package "net/rpc".
The implementation compiles a custom codec for each data type in the stream and is most efficient when a single Encoder is used to transmit a stream of values, amortizing the cost of compilation.
跨语言性差,局限在golang
Use Python pickling. This is the default approach since it's built into the language, but it doesn't deal well with schema evolution, and also doesn't work very well if you need to share data with applications written in C++ or Java.
11.1. pickle — Python object serialization — Python 2.7.16 documentation
https://docs.python.org/2/library/pickle.html
The pickle
module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” [1] or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.
This documentation describes both the pickle
module and the cPickle
module.
11.1. pickle — Python object serialization — Python 2.7.16 documentation
https://docs.python.org/2/library/pickle.html#module-cPickle
The cPickle
module supports serialization and de-serialization of Python objects, providing an interface and functionality nearly identical to the pickle
module. There are several differences, the most important being performance and subclassability.
First, cPickle
can be up to 1000 times faster than pickle
because the former is implemented in C. Second, in the cPickle
module the callables Pickler()
and Unpickler()
are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle
module.
The pickle data stream produced by pickle
and cPickle
are identical, so it is possible to use pickle
and cPickle
interchangeably with existing pickles. [10]
There are additional minor differences in API between cPickle
and pickle
, however for most applications, they are interchangeable. More documentation is provided in the pickle
module documentation, which includes a list of the documented differences.
2/3、string
普通字符串的缺点是只能描述简单的数据结构
3/3、XML
跨语言性好,但是资源消耗高,性能差
更小更快
Developer Guide | Protocol Buffers | Google Developers
https://developers.google.com/protocol-buffers/docs/overview
Developer Guide
Welcome to the developer documentation for protocol buffers – a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more.
This documentation is aimed at Java, C++, or Python developers who want to use protocol buffers in their applications. This overview introduces protocol buffers and tells you what you need to do to get started – you can then go on to follow the tutorials or delve deeper into protocol buffer encoding. API reference documentation is also provided for all three languages, as well as language and style guides for writing .proto
files.
What are protocol buffers?
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.
How do they work?
You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto
files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto
file that defines a message containing information about a person:
message Person { required string name = 1; required int32 id = 2; optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phone = 4; }
As you can see, the message format is simple – each message type has one or more uniquely numbered fields, and each field has a name and a value type, where value types can be numbers (integer or floating-point), booleans, strings, raw bytes, or even (as in the example above) other protocol buffer message types, allowing you to structure your data hierarchically. You can specify optional fields, required fields, and repeated fields. You can find more information about writing .proto
files in the Protocol Buffer Language Guide.
Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto
file to generate data access classes. These provide simple accessors for each field (like name()
and set_name()
) as well as methods to serialize/parse the whole structure to/from raw bytes – so, for instance, if your chosen language is C++, running the compiler on the above example will generate a class called Person
. You can then use this class in your application to populate, serialize, and retrieve Person
protocol buffer messages. You might then write some code like this:
Person person;
person.set_name("John Doe");
person.set_id(1234);
person.set_email("jdoe@example.com");
fstream output("myfile", ios::out | ios::binary);
person.SerializeToOstream(&output);
Then, later on, you could read your message back in:
fstream input("myfile", ios::in | ios::binary);
Person person;
person.ParseFromIstream(&input);
cout << "Name: " << person.name() << endl;
cout << "E-mail: " << person.email() << endl;
You can add new fields to your message formats without breaking backwards-compatibility; old binaries simply ignore the new field when parsing. So if you have a communications protocol that uses protocol buffers as its data format, you can extend your protocol without having to worry about breaking existing code.
You'll find a complete reference for using generated protocol buffer code in the API Reference section, and you can find out more about how protocol buffer messages are encoded in Protocol Buffer Encoding.
Why not just use XML?
Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:
- are simpler
- are 3 to 10 times smaller
- are 20 to 100 times faster
- are less ambiguous
- generate data access classes that are easier to use programmatically
For example, let's say you want to model a person
with a name
and an email
. In XML, you need to do:
<person> <name>John Doe</name> <email>jdoe@example.com</email> </person>
while the corresponding protocol buffer message (in protocol buffer text format) is:
# Textual representation of a protocol buffer. # This is *not* the binary format used on the wire. person { name: "John Doe" email: "jdoe@example.com" }
When this message is encoded to the protocol buffer binary format (the text format above is just a convenient human-readable representation for debugging and editing), it would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes if you remove whitespace, and would take around 5,000-10,000 nanoseconds to parse.
Also, manipulating a protocol buffer is much easier:
cout << "Name: " << person.name() << endl;
cout << "E-mail: " << person.email() << endl;
Whereas with XML you would have to do something like:
cout << "Name: "
<< person.getElementsByTagName("name")->item(0)->innerText()
<< endl;
cout << "E-mail: "
<< person.getElementsByTagName("email")->item(0)->innerText()
<< endl;
However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto
file).
编译器
binary format <--- the text format
https://github.com/protocolbuffers/protobuf/releases/tag/v3.7.1
Protocol Buffers source code is hosted on GitHub: https://github.com/protocolbuffers/protobuf.
Our old Google Code repository is: https://code.google.com/p/protobuf/. We moved to GitHub on Aug 26, 2014 and no future changes will be made on the Google Code site. For latest code updates/issues, please visit our GitHub site.
Compiling Your Protocol Buffers
https://developers.google.com/protocol-buffers/docs/pythontutorial
protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto
https://mp.weixin.qq.com/s/eijIRBx-vKln0AftMqXkmQ