Caffe2 创建你的专属数据集(Create Your Own Dataset)[9]

这一节尝试把你的数据转换成caffe2能够使用的形式。这个教程使用Iris的数据集。你可以点击这里查看Ipython Notebook教程。

DB数据格式

Caffe2使用二进制的DB格式来保存数据。Caffe2 DB其实是键-值存储方式的一个美名而已。在键-值(key-value)存储方式里,键是随机生成的,所以batches是独立同分布的。而值(Value)则是真正的数据,他们包含着训练过程中真正用到的数据。所以,DB中保存的数据格式就像下面这样:

key1 value1 key2 value2 key3 value3 ...

在DB中,他把keys和values看成strings。你可以用TensorProtos protobuf来将你要保存的东西保存成DB数据结构。一个TensorProtos protobuf封装了Tensor(多维矩阵),和它的数据类型,形状信息。然后,你可以通过TensorProtosDBInput操作来载入数据到SGD训练过程中。

准备自己的数据

这里,我们向你展示如何创建自己的数据集。为此,我们将会使用UCI Iris数据集。这是一个非常受欢迎的经典的用于分类鸢尾花的数据集。它包含4个代表花的外形特征的实数。这个数据集包含3种鸢尾花。你可以从这里下载数据集

%matplotlib inline
import urllib2 # 用于从网上下载数据集
import numpy as np
from matplotlib import pyplot
from StringIO import StringIO
from caffe2.python import core, utils, workspace
from caffe2.proto import caffe2_pb2
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: No module named caffe2_pybind11_state_gpu
#如果你在Mac OS下使用homebrew,你可能会遇到一个错误: malloc_zone_unregister() 函数失败.这不是Caffe2的问题,而是因为 homebrew leveldb 的内存分配不兼容. 但这不影响使用。
f = urllib2.urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
raw_data = f.read()
print('Raw data looks like this:')
print(raw_data[:100] + '...')

输出:

Raw data looks like this:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,...
#将特征保存到一个特征矩阵
features = np.loadtxt(StringIO(raw_data), dtype=np.float32, delimiter=',', usecols=(0, 1, 2, 3))
#把label存到一个特征矩阵中
label_converter = lambda s : {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}[s]
labels = np.loadtxt(StringIO(raw_data), dtype=np.int, delimiter=',', usecols=(4,), converters={4: label_converter})

在我们开始训练之前,最好将数据集分成训练集和测试集。在这个例子中,让我们随机打乱数据,用前100个数据做训练集,剩余50个数据做测试。当然你也可以用更加复杂的方式,例如使用交叉校验的方式将数据集分成多个训练集和测试集。关于交叉校验的更多信息,请看这里

random_index = np.random.permutation(150)
features = features[random_index]
labels = labels[random_index]
train_features = features[:100]
train_labels = labels[:100]
test_features = features[100:]
test_labels = labels[100:]
legend = ['rx', 'b+', 'go']
pyplot.title("Training data distribution, feature 0 and 1")
for i in range(3):
    pyplot.plot(train_features[train_labels==i, 0], train_features[train_labels==i, 1], legend[i])
pyplot.figure()
pyplot.title("Testing data distribution, feature 0 and 1")
for i in range(3):
    pyplot.plot(test_features[test_labels==i, 0], test_features[test_labels==i, 1], legend[i])


现在,把数据放进Caffe2的DB中去。在这个DB中,我们将会使用train_xxx作为key,并对于每一个点使用一个TensorProtos对象去储存,一个TensorProtos包含两个tensor:一个是特征,一个是label。我们使用Caffe2的Python DB接口。

# 构建一个TensorProtos protobuf 
feature_and_label = caffe2_pb2.TensorProtos()
feature_and_label.protos.extend([
    utils.NumpyArrayToCaffe2Tensor(features[0]),
    utils.NumpyArrayToCaffe2Tensor(labels[0])])
print('This is what the tensor proto looks like for a feature and its label:')
print(str(feature_and_label))
print('This is the compact string that gets written into the db:')
print(feature_and_label.SerializeToString())
This is what the tensor proto looks like for a feature and its label:
protos {
  dims: 4
  data_type: FLOAT
  float_data: 5.40000009537
  float_data: 3.0
  float_data: 4.5
  float_data: 1.5
}
protos {
  data_type: INT32
  int32_data: 1
}
This is the compact string that gets written into the db:
�̬@@@�@�?
"

现在真正写入DB中去

def write_db(db_type, db_name, features, labels):
    db = core.C.create_db(db_type, db_name, core.C.Mode.write)
    transaction = db.new_transaction()
    for i in range(features.shape[0]):
        feature_and_label = caffe2_pb2.TensorProtos()
        feature_and_label.protos.extend([
            utils.NumpyArrayToCaffe2Tensor(features[i]),
            utils.NumpyArrayToCaffe2Tensor(labels[i])])
        transaction.put(
            'train_%03d'.format(i),
            feature_and_label.SerializeToString())
    # Close the transaction, and then close the db.
    del transaction
    del db

write_db("minidb", "iris_train.minidb", train_features, train_labels)
write_db("minidb", "iris_test.minidb", test_features, test_labels)

现在让我恩创建一个简单的网络,这个网络只包含一个简单的TensorProtosDBInput 操作,用来展示我们如何从创建好的DB中读入数据。

net_proto = core.Net("example_reader")
dbreader = net_proto.CreateDB([], "dbreader", db="iris_train.minidb", db_type="minidb")
net_proto.TensorProtosDBInput([dbreader], ["X", "Y"], batch_size=16)

print("The net looks like this:")
print(str(net_proto.Proto()))
The net looks like this:
name: "example_reader"
op {
  output: "dbreader"
  name: ""
  type: "CreateDB"
  arg {
    name: "db_type"
    s: "minidb"
  }
  arg {
    name: "db"
    s: "iris_train.minidb"
  }
}
op {
  input: "dbreader"
  output: "X"
  output: "Y"
  name: ""
  type: "TensorProtosDBInput"
  arg {
    name: "batch_size"
    i: 16
  }
}

创建网络

workspace.CreateNet(net_proto)
# 先跑一次,然后获取里面的数据
workspace.RunNet(net_proto.Proto().name)
print("The first batch of feature is:")
print(workspace.FetchBlob("X"))
print("The first batch of label is:")
print(workspace.FetchBlob("Y"))

# 再跑一次
workspace.RunNet(net_proto.Proto().name)
print("The second batch of feature is:")
print(workspace.FetchBlob("X"))
print("The second batch of label is:")
print(workspace.FetchBlob("Y"))
The first batch of feature is:
[[ 5.19999981  4.0999999   1.5         0.1       ]
 [ 5.0999999   3.79999995  1.5         0.30000001]
 [ 6.9000001   3.0999999   4.9000001   1.5       ]
 [ 7.69999981  2.79999995  6.69999981  2.        ]
 [ 6.5999999   2.9000001   4.5999999   1.29999995]
 [ 6.30000019  2.79999995  5.0999999   1.5       ]
 [ 7.30000019  2.9000001   6.30000019  1.79999995]
 [ 5.5999999   2.9000001   3.5999999   1.29999995]
 [ 6.5         3.          5.19999981  2.        ]
 [ 5.          3.4000001   1.5         0.2       ]
 [ 6.9000001   3.0999999   5.4000001   2.0999999 ]
 [ 6.          3.4000001   4.5         1.60000002]
 [ 5.4000001   3.4000001   1.70000005  0.2       ]
 [ 6.30000019  2.70000005  4.9000001   1.79999995]
 [ 5.19999981  2.70000005  3.9000001   1.39999998]
 [ 6.19999981  2.9000001   4.30000019  1.29999995]]
The first batch of label is:
[0 0 1 2 1 2 2 1 2 0 2 1 0 2 1 1]
The second batch of feature is:
[[ 5.69999981  2.79999995  4.0999999   1.29999995]
 [ 5.0999999   2.5         3.          1.10000002]
 [ 4.4000001   2.9000001   1.39999998  0.2       ]
 [ 7.          3.20000005  4.69999981  1.39999998]
 [ 5.69999981  2.9000001   4.19999981  1.29999995]
 [ 5.          3.5999999   1.39999998  0.2       ]
 [ 5.19999981  3.5         1.5         0.2       ]
 [ 6.69999981  3.          5.19999981  2.29999995]
 [ 6.19999981  3.4000001   5.4000001   2.29999995]
 [ 6.4000001   2.70000005  5.30000019  1.89999998]
 [ 6.5         3.20000005  5.0999999   2.        ]
 [ 6.0999999   3.          4.9000001   1.79999995]
 [ 5.4000001   3.4000001   1.5         0.40000001]
 [ 4.9000001   3.0999999   1.5         0.1       ]
 [ 5.5         3.5         1.29999995  0.2       ]
 [ 6.69999981  3.          5.          1.70000005]]
The second batch of label is:
[1 1 0 1 1 0 0 2 2 2 2 2 0 0 0 1]

至此,本节教程结束。
转载请注明出处:http://www.jianshu.com/c/cf07b31bb5f2

posted @ 2017-05-08 14:32  少侠阿朱  阅读(577)  评论(0编辑  收藏  举报