python处理二进制文件(字节byte和比特bit)
一、如果按字节处理,可以用struct
https://docs.python.org/2/library/struct.html
By default, C types are represented in the machine’s native format and byte order, and properly aligned by skipping pad bytes if necessary (according to the rules used by the C compiler).
Alternatively, the first character of the format string can be used to indicate the byte order, size and alignment of the packed data, according to the following table:
Character |
Byte order |
Size |
Alignment |
---|---|---|---|
|
native |
native |
native |
|
native |
standard |
none |
|
little-endian |
standard |
none |
|
big-endian |
standard |
none |
|
network (= big-endian) |
standard |
none |
If the first character is not one of these, '@'
is assumed.
Format characters have the following meaning; the conversion between C and Python values should be obvious given their types. The ‘Standard size’ column refers to the size of the packed value in bytes when using standard size; that is, when the format string starts with one of '<'
, '>'
, '!'
or '='
. When using native size, the size of the packed value is platform-dependent.
Format |
C Type |
Python type |
Standard size |
Notes |
---|---|---|---|---|
|
pad byte |
no value |
||
|
|
string of length 1 |
1 |
|
|
|
integer |
1 |
(3) |
|
|
integer |
1 |
(3) |
|
|
bool |
1 |
(1) |
|
|
integer |
2 |
(3) |
|
|
integer |
2 |
(3) |
|
|
integer |
4 |
(3) |
|
|
integer |
4 |
(3) |
|
|
integer |
4 |
(3) |
|
|
integer |
4 |
(3) |
|
|
integer |
8 |
(2), (3) |
|
|
integer |
8 |
(2), (3) |
|
|
float |
4 |
(4) |
|
|
float |
8 |
(4) |
|
|
string |
||
|
|
string |
||
|
|
integer |
(5), (3) |
Notes:
-
The
'?'
conversion code corresponds to the_Bool
type defined by C99. If this type is not available, it is simulated using achar
. In standard mode, it is always represented by one byte.New in version 2.6.
-
The
'q'
and'Q'
conversion codes are available in native mode only if the platform C compiler supports Clong long
, or, on Windows,__int64
. They are always available in standard modes.New in version 2.2.
-
When attempting to pack a non-integer using any of the integer conversion codes, if the non-integer has a
__index__()
method then that method is called to convert the argument to an integer before packing. If no__index__()
method exists, or the call to__index__()
raisesTypeError
, then the__int__()
method is tried. However, the use of__int__()
is deprecated, and will raiseDeprecationWarning
.Changed in version 2.7: Use of the
__index__()
method for non-integers is new in 2.7.Changed in version 2.7: Prior to version 2.7, not all integer conversion codes would use the
__int__()
method to convert, andDeprecationWarning
was raised only for float arguments. -
For the
'f'
and'd'
conversion codes, the packed representation uses the IEEE 754 binary32 (for'f'
) or binary64 (for'd'
) format, regardless of the floating-point format used by the platform. -
The
'P'
format character is only available for the native byte ordering (selected as the default or with the'@'
byte order character). The byte order character'='
chooses to use little- or big-endian ordering based on the host system. The struct module does not interpret this as native ordering, so the'P'
format is not available.
A format character may be preceded by an integral repeat count. For example, the format string '4h'
means exactly the same as 'hhhh'
.
示例:
比如有一个结构体
struct Header
{
unsigned short id;
char[4] tag;
unsigned int version;
unsigned int count;
}
通过socket.recv接收到了一个上面的结构体数据,存在字符串s中,现在需要把它解析出来,可以使用unpack()函数.
import struct
id, tag, version, count = struct.unpack("!H4s2I", s)
上面的格式字符串中,!表示我们要使用网络字节顺序解析,因为我们的数据是从网络中接收到的,在网络上传送的时候它是网络字节顺序的.后面的H表示 一个unsigned short的id,4s表示4字节长的字符串,2I表示有两个unsigned int类型的数据.
就通过一个unpack,现在id, tag, version, count里已经保存好我们的信息了.
同样,也可以很方便的把本地数据再pack成struct格式.
ss = struct.pack("!H4s2I", id, tag, version, count);
pack函数就把id, tag, version, count按照指定的格式转换成了结构体Header,ss现在是一个字符串(实际上是类似于c结构体的字节流),可以通过 socket.send(ss)把这个字符串发送出去.
示例二:
import struct
a=12.34
#将a变为二进制
bytes=struct.pack('i',a)
此时bytes就是一个string字符串,字符串按字节同a的二进制存储内容相同。
再进行反操作
现有二进制数据bytes,(其实就是字符串),将它反过来转换成python的数据类型:
a,=struct.unpack('i',bytes)
注意,unpack返回的是tuple
所以如果只有一个变量的话:
bytes=struct.pack('i',a)
那么,解码的时候需要这样
a,=struct.unpack('i',bytes) 或者 (a,)=struct.unpack('i',bytes)
如果直接用a=struct.unpack('i',bytes),那么 a=(12.34,) ,是一个tuple而不是原来的浮点数了。
如果是由多个数据构成的,可以这样:
a='hello'
b='world!'
c=2
d=45.123
bytes=struct.pack('5s6sif',a,b,c,d)
此时的bytes就是二进制形式的数据了,可以直接写入文件比如 binfile.write(bytes)
然后,当我们需要时可以再读出来,bytes=binfile.read()
再通过struct.unpack()解码成python变量
a,b,c,d=struct.unpack('5s6sif',bytes)
'5s6sif'这个叫做fmt,就是格式化字符串,由数字加字符构成,5s表示占5个字符的字符串,2i,表示2个整数等等,下面是可用的字符及类型,ctype表示可以与python中的类型一一对应。
示例3:
file = open(file_name, "rb")
short_data = struct.unpack('<h',file.read(2))[0]
float_data = struct.unpack('<f', file.read(4))[0]
2. 有些协议定义字段长度是按照bit为单位的,3bit宽度,7bit宽度等,这样的就不适合用struct了,
我们可以用bitstring,处理起来较为简单
https://pypi.org/project/bitstring/
代码示例:
import bitstring file = open(file_name, "rb") file_b = bitstring.BitStream(bytes=file.read() print file_b.read(3).int
print file_b.read(3).int
print file_b.read(7).bytes
也可以定义结构体
fmt = 'sequence_header_code, uint:12=horizontal_size_value, uint:12=vertical_size_value, uint:4=aspect_ratio_information, ... ' d = {'sequence_header_code': '0x000001b3', 'horizontal_size_value': 352, 'vertical_size_value': 288, 'aspect_ratio_information': 1, ... } s = bitstring.pack(fmt, **d)
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!