OFRecord 数据格式

OFRecord 数据格式

深度学习应用需要复杂的多阶段数据预处理流水线，数据加载是流水线的第一步，OneFlow 支持多种格式数据的加载，其中 OFRecord 格式是 OneFlow 原生的数据格式。

OFRecord 的格式定义参考了 TensorFlow 的 TFRecord，熟悉 TFRecord 的用户，可以很快上手 OneFlow 的 OFRecord。

本文将介绍：

OFRecord 使用的数据类型
如何将数据转化为 OFRecord 对象并序列化
OFRecord 文件格式

有助于学习加载与准备 OFRecord 数据集。

OFRecord 相关数据类型

OneFlow 内部采用Protocol Buffers 描述 OFRecord 的序列化格式。相关的 .proto 文件在 oneflow/core/record/record.proto 中，具体定义如下：

syntax = "proto2";

package oneflow;

message BytesList {

repeated bytes value = 1;

}

message FloatList {

repeated float value = 1 [packed = true];

}

message DoubleList {

repeated double value = 1 [packed = true];

}

message Int32List {

repeated int32 value = 1 [packed = true];

}

message Int64List {

repeated int64 value = 1 [packed = true];

}

message Feature {

oneof kind {

BytesList bytes_list = 1;

FloatList float_list = 2;

DoubleList double_list = 3;

Int32List int32_list = 4;

Int64List int64_list = 5;

}

}

message OFRecord {

map<string, Feature> feature = 1;

}

先对以上的重要数据类型进行解释：

OFRecord: OFRecord 的实例化对象，可用于存储所有需要序列化的数据。它由任意多个 string->Feature 的键值对组成；
Feature: Feature 可存储 BytesList、FloatList、DoubleList、Int32List、Int64List 各类型中的任意一种；
OFRecord、Feature、XXXList 等类型，均由 Protocol Buffers 生成对应的同名接口，使得可以在 Python 层面构造对应对象。

转化数据为 Feature 格式

可以通过调用 ofrecord.xxxList 及 ofrecord.Feature 将数据转为 Feature 格式，为了更加方便，需要对 protocol buffers 生成的接口进行简单封装：

import oneflow.core.record.record_pb2 as ofrecord

def int32_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.Feature(int32_list=ofrecord.Int32List(value=value))

def int64_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.Feature(int64_list=ofrecord.Int64List(value=value))

def float_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.Feature(float_list=ofrecord.FloatList(value=value))

def double_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

return ofrecord.Feature(double_list=ofrecord.DoubleList(value=value))

def bytes_feature(value):

if not isinstance(value, (list, tuple)):

value = [value]

if not six.PY2:

if isinstance(value[0], str):

value = [x.encode() for x in value]

return ofrecord.Feature(bytes_list=ofrecord.BytesList(value=value))

创建 OFRecord 对象并序列化

在下例子中，将创建有2个 feature 的 OFRecord 对象，并且调用它的 SerializeToString 方法序列化。

obserations = 28 * 28

f = open("./dataset/part-0", "wb")

for loop in range(0, 3):

image = [random.random() for x in range(0, obserations)]

label = [random.randint(0, 9)]

topack = {

"images": float_feature(image),

"labels": int64_feature(label),

}

ofrecord_features = ofrecord.OFRecord(feature=topack)

serilizedBytes = ofrecord_features.SerializeToString()

通过以上例子，可以总结序列化数据的步骤：

将需要序列化的数据，通过调用 ofrecord.Feature 及 ofrecord.XXXList 转为 Feature 对象；
将上一步得到的各个 Feature 对象，以 string->Feature 键值对的形式，存放在 Python 字典中；
调用 ofrecord.OFRecord 创建 OFRecord 对象
调用 OFRecord 对象的 SerializeToString 方法得到序列化结果

序列化的结果，可以存为 ofrecord 格式的文件。

OFRecord 格式的文件

将 OFRecord 对象序列化后按 OneFlow 约定的格式存文件，就得到 OFRecord文件。

1个 OFRecord 文件中可存储多个 OFRecord 对象，OFRecord 文件可用于 OneFlow 数据流水线，具体操作可见加载与准备 OFRecord 数据集

OneFlow 约定，对于每个 OFRecord 对象，用以下格式存储：

uint64 length

byte data[length]

即头8个字节存入数据长度，然后存入序列化数据本身。

length = ofrecord_features.ByteSize()

f.write(struct.pack("q", length))

f.write(serilizedBytes)

代码

以下完整代码展示如何生成 OFRecord 文件，并调用 protobuf 生成的 OFRecord 接口手工读取 OFRecord 文件中的数据。

实际上，OneFlow 提供了 flow.data.decode_ofrecord 等接口，可以更方便地提取 OFRecord 文件（数据集）中的内容。详细内容请参见加载与准备 OFRecord 数据集。

将 OFRecord 对象写入文件

以下脚本，模拟了3个样本，每个样本为28*28的图片，并且包含对应标签。将三个样本转化为 OFRecord 对象后，按照 OneFlow 约定格式，存入文件。

代码：ofrecord_to_string.py

从 OFRecord 文件中读取数据

以下脚本，读取上例中生成的 OFRecord 文件，调用 FromString 方法反序列化得到 OFRecord 对象，并最终显示数据：

代码：ofrecord_from_string.py

posted @ 2021-02-16 05:52 吴建明wujianming 阅读(85) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

【推荐】还在用 ECharts 开发大屏？试试这款永久免费的开源 BI 工具！
【推荐】国内首个AI IDE，深度理解中文开发场景，立即下载体验Trae
【推荐】编程新体验，更懂你的AI，立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包，你的智能百科全书，全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell：AI 加持，快人一步

编辑推荐：
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识（上）
· 浏览器原生「磁吸」效果！Anchor Positioning 锚点定位神器解析
· 没有源码，如何修改代码逻辑？

阅读排行：
· 全程不用写代码，我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· .NET10 - 预览版1新功能体验（一）

公告

昵称：吴建明wujianming
园龄： 7年5个月
粉丝： 532
关注： 0

<

2025年3月

>

日

一

二

三

四

五

六

23

24

25

26

27

28

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

1. Re:云计算服务器技术市场分析
Super Pi linux工具可以提供下吗？网上包括官方的都没法在centos、ubuntu os下跑，

xuxu8511@163.com 多谢。
--xu111122
2. Re:RISC-V指令列表分析
jalr中的14到12位应该为000而不是010吧
--洛天V
3. Re:《LLVM编译器原理与实践》新书推荐（已出版）
已买，学习中，支持博主
--tieyan
4. Re:NPU与超异构计算杂谈
好文章，很详细的对异构做了Overview！感谢大佬🫡
--kaiZH
5. Re:《LLVM编译器原理与实践》新书推荐（已出版）
这本书中对于LLVM垃圾回收机制statepoint有做研究吗？或者博主有推荐的资料吗？官网的资料确实比较晦涩难懂而且没有案例。
--wingrez