Caffe2 创建你的专属数据集(Create Your Own Dataset)[9]
这一节尝试把你的数据转换成caffe2能够使用的形式。这个教程使用Iris的数据集。你可以点击这里查看Ipython Notebook教程。
DB数据格式
Caffe2使用二进制的DB格式来保存数据。Caffe2 DB其实是键-值存储方式的一个美名而已。在键-值(key-value)存储方式里,键是随机生成的,所以batches是独立同分布的。而值(Value)则是真正的数据,他们包含着训练过程中真正用到的数据。所以,DB中保存的数据格式就像下面这样:
key1 value1 key2 value2 key3 value3 ...
在DB中,他把keys和values看成strings。你可以用TensorProtos protobuf来将你要保存的东西保存成DB数据结构。一个TensorProtos protobuf封装了Tensor(多维矩阵),和它的数据类型,形状信息。然后,你可以通过TensorProtosDBInput操作来载入数据到SGD训练过程中。
准备自己的数据
这里,我们向你展示如何创建自己的数据集。为此,我们将会使用UCI Iris数据集。这是一个非常受欢迎的经典的用于分类鸢尾花的数据集。它包含4个代表花的外形特征的实数。这个数据集包含3种鸢尾花。你可以从这里下载数据集。
%matplotlib inline
import urllib2 # 用于从网上下载数据集
import numpy as np
from matplotlib import pyplot
from StringIO import StringIO
from caffe2.python import core, utils, workspace
from caffe2.proto import caffe2_pb2
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: No module named caffe2_pybind11_state_gpu
#如果你在Mac OS下使用homebrew,你可能会遇到一个错误: malloc_zone_unregister() 函数失败.这不是Caffe2的问题,而是因为 homebrew leveldb 的内存分配不兼容. 但这不影响使用。
f = urllib2.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
raw_data = f.read()
print('Raw data looks like this:')
print(raw_data[:100] + '...')
输出:
Raw data looks like this:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,...
#将特征保存到一个特征矩阵
features = np.loadtxt(StringIO(raw_data), dtype=np.float32, delimiter=',', usecols=(0, 1, 2, 3))
#把label存到一个特征矩阵中
label_converter = lambda s : {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}[s]
labels = np.loadtxt(StringIO(raw_data), dtype=np.int, delimiter=',', usecols=(4,), converters={4: label_converter})
在我们开始训练之前,最好将数据集分成训练集和测试集。在这个例子中,让我们随机打乱数据,用前100个数据做训练集,剩余50个数据做测试。当然你也可以用更加复杂的方式,例如使用交叉校验的方式将数据集分成多个训练集和测试集。关于交叉校验的更多信息,请看这里。
random_index = np.random.permutation(150)
features = features[random_index]
labels = labels[random_index]
train_features = features[:100]
train_labels = labels[:100]
test_features = features[100:]
test_labels = labels[100:]
legend = ['rx', 'b+', 'go']
pyplot.title("Training data distribution, feature 0 and 1")
for i in range(3):
pyplot.plot(train_features[train_labels==i, 0], train_features[train_labels==i, 1], legend[i])
pyplot.figure()
pyplot.title("Testing data distribution, feature 0 and 1")
for i in range(3):
pyplot.plot(test_features[test_labels==i, 0], test_features[test_labels==i, 1], legend[i])
现在,把数据放进Caffe2的DB中去。在这个DB中,我们将会使用train_xxx
作为key,并对于每一个点使用一个TensorProtos对象去储存,一个TensorProtos包含两个tensor:一个是特征,一个是label。我们使用Caffe2的Python DB接口。
# 构建一个TensorProtos protobuf
feature_and_label = caffe2_pb2.TensorProtos()
feature_and_label.protos.extend([
utils.NumpyArrayToCaffe2Tensor(features[0]),
utils.NumpyArrayToCaffe2Tensor(labels[0])])
print('This is what the tensor proto looks like for a feature and its label:')
print(str(feature_and_label))
print('This is the compact string that gets written into the db:')
print(feature_and_label.SerializeToString())
This is what the tensor proto looks like for a feature and its label:
protos {
dims: 4
data_type: FLOAT
float_data: 5.40000009537
float_data: 3.0
float_data: 4.5
float_data: 1.5
}
protos {
data_type: INT32
int32_data: 1
}
This is the compact string that gets written into the db:
�̬@@@�@�?
"
现在真正写入DB中去
def write_db(db_type, db_name, features, labels):
db = core.C.create_db(db_type, db_name, core.C.Mode.write)
transaction = db.new_transaction()
for i in range(features.shape[0]):
feature_and_label = caffe2_pb2.TensorProtos()
feature_and_label.protos.extend([
utils.NumpyArrayToCaffe2Tensor(features[i]),
utils.NumpyArrayToCaffe2Tensor(labels[i])])
transaction.put(
'train_%03d'.format(i),
feature_and_label.SerializeToString())
# Close the transaction, and then close the db.
del transaction
del db
write_db("minidb", "iris_train.minidb", train_features, train_labels)
write_db("minidb", "iris_test.minidb", test_features, test_labels)
现在让我恩创建一个简单的网络,这个网络只包含一个简单的TensorProtosDBInput 操作,用来展示我们如何从创建好的DB中读入数据。
net_proto = core.Net("example_reader")
dbreader = net_proto.CreateDB([], "dbreader", db="iris_train.minidb", db_type="minidb")
net_proto.TensorProtosDBInput([dbreader], ["X", "Y"], batch_size=16)
print("The net looks like this:")
print(str(net_proto.Proto()))
The net looks like this:
name: "example_reader"
op {
output: "dbreader"
name: ""
type: "CreateDB"
arg {
name: "db_type"
s: "minidb"
}
arg {
name: "db"
s: "iris_train.minidb"
}
}
op {
input: "dbreader"
output: "X"
output: "Y"
name: ""
type: "TensorProtosDBInput"
arg {
name: "batch_size"
i: 16
}
}
创建网络
workspace.CreateNet(net_proto)
# 先跑一次,然后获取里面的数据
workspace.RunNet(net_proto.Proto().name)
print("The first batch of feature is:")
print(workspace.FetchBlob("X"))
print("The first batch of label is:")
print(workspace.FetchBlob("Y"))
# 再跑一次
workspace.RunNet(net_proto.Proto().name)
print("The second batch of feature is:")
print(workspace.FetchBlob("X"))
print("The second batch of label is:")
print(workspace.FetchBlob("Y"))
The first batch of feature is:
[[ 5.19999981 4.0999999 1.5 0.1 ]
[ 5.0999999 3.79999995 1.5 0.30000001]
[ 6.9000001 3.0999999 4.9000001 1.5 ]
[ 7.69999981 2.79999995 6.69999981 2. ]
[ 6.5999999 2.9000001 4.5999999 1.29999995]
[ 6.30000019 2.79999995 5.0999999 1.5 ]
[ 7.30000019 2.9000001 6.30000019 1.79999995]
[ 5.5999999 2.9000001 3.5999999 1.29999995]
[ 6.5 3. 5.19999981 2. ]
[ 5. 3.4000001 1.5 0.2 ]
[ 6.9000001 3.0999999 5.4000001 2.0999999 ]
[ 6. 3.4000001 4.5 1.60000002]
[ 5.4000001 3.4000001 1.70000005 0.2 ]
[ 6.30000019 2.70000005 4.9000001 1.79999995]
[ 5.19999981 2.70000005 3.9000001 1.39999998]
[ 6.19999981 2.9000001 4.30000019 1.29999995]]
The first batch of label is:
[0 0 1 2 1 2 2 1 2 0 2 1 0 2 1 1]
The second batch of feature is:
[[ 5.69999981 2.79999995 4.0999999 1.29999995]
[ 5.0999999 2.5 3. 1.10000002]
[ 4.4000001 2.9000001 1.39999998 0.2 ]
[ 7. 3.20000005 4.69999981 1.39999998]
[ 5.69999981 2.9000001 4.19999981 1.29999995]
[ 5. 3.5999999 1.39999998 0.2 ]
[ 5.19999981 3.5 1.5 0.2 ]
[ 6.69999981 3. 5.19999981 2.29999995]
[ 6.19999981 3.4000001 5.4000001 2.29999995]
[ 6.4000001 2.70000005 5.30000019 1.89999998]
[ 6.5 3.20000005 5.0999999 2. ]
[ 6.0999999 3. 4.9000001 1.79999995]
[ 5.4000001 3.4000001 1.5 0.40000001]
[ 4.9000001 3.0999999 1.5 0.1 ]
[ 5.5 3.5 1.29999995 0.2 ]
[ 6.69999981 3. 5. 1.70000005]]
The second batch of label is:
[1 1 0 1 1 0 0 2 2 2 2 2 0 0 0 1]
至此,本节教程结束。
转载请注明出处:http://www.jianshu.com/c/cf07b31bb5f2