Classification with HDF5 data

这一节的参考网页是http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/hdf5_classification.ipynb,主要将scikit-learn SGD logistic regression 与 通过caffe 学习到的logistic regression进行比较,如果caffe使用线性模型,那么二者的准备率差不多,而如果caffe使用非线性模型,即在网络中加入激活函数层(一般使用max(0,x)函数 ) ,那么caffe所学模型准备率会高大概10个百分点。

1、随机生成数据并可视化

import os
import h5py
import shutil
import sklearn
import tempfile
import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.linear_model
import matplotlib.pyplot as plt


X, y = sklearn.datasets.make_classification(
    n_samples=10000, n_features=4, n_redundant=0, n_informative=2, 
    n_clusters_per_class=2, hypercube=False, random_state=0
)

# Split into train and test
X, Xt, y, yt = sklearn.cross_validation.train_test_split(X, y)


# Visualize sample of the data
ind = np.random.permutation(X.shape[0])[:1000]
df = pd.DataFrame(X[ind])
_ = pd.scatter_matrix(df, figsize=(9, 9), diagonal='kde', marker='o', s=40, alpha=.4, c=y[ind])

2、训练和测试 scikit-learn SGD 模型

# Train and test the scikit-learn SGD logistic regression.
clf = sklearn.linear_model.SGDClassifier(
    loss='log', n_iter=1000, penalty='l2', alpha=1e-3, class_weight='auto')

clf.fit(X, y)
yt_pred = clf.predict(Xt)
print('Accuracy: {:.3f}'.format(sklearn.metrics.accuracy_score(yt, yt_pred)))

本人准确率为0.4左右,而教程上为0.763,不晓得为何,测了好几次

3、使用caffe训练,先将刚才生成的数据转化成caffe能接受的数据格式

# Write out the data to HDF5 files in a temp directory.
# This file is assumed to be caffe_root/examples/hdf5_classification.ipynb
dirname = os.path.abspath('./hdf5_classification/data')
if not os.path.exists(dirname):
    os.makedirs(dirname)

train_filename = os.path.join(dirname, 'train.h5')
test_filename = os.path.join(dirname, 'test.h5')

# HDF5DataLayer source should be a file containing a list of HDF5 filenames.
# To show this off, we'll list the same data file twice.
with h5py.File(train_filename, 'w') as f:
    f['data'] = X
    f['label'] = y.astype(np.float32)
with open(os.path.join(dirname, 'train.txt'), 'w') as f:
    f.write(train_filename + '\n')
    f.write(train_filename + '\n')
    
# HDF5 is pretty efficient, but can be further compressed.
comp_kwargs = {'compression': 'gzip', 'compression_opts': 1}
with h5py.File(test_filename, 'w') as f:
    f.create_dataset('data', data=Xt, **comp_kwargs)
    f.create_dataset('label', data=yt.astype(np.float32), **comp_kwargs)
with open(os.path.join(dirname, 'test.txt'), 'w') as f:
    f.write(test_filename + '\n')

然后,训练模型,当前目录为caffe_root

 ./build/tools/caffe train -solver examples/hdf5_classification/solver.prototxt

准确率真与SGD模型差不多

3、对caffe模型增加非线性变化层

 ./build/tools/caffe train -solver examples/hdf5_classification/solver2.prototxt

本人的准确率真提高到0.5 + ,网页教程上提高到0.8+,可以看出通过非线性变化准确率有很大的提高。如果仅使用线性模型,那么多层网络其实相当于一层而已

posted @ 2015-01-23 19:00  dupuleng  阅读(1426)  评论(0)    收藏  举报