Classification with HDF5 data
这一节的参考网页是http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/hdf5_classification.ipynb,主要将scikit-learn SGD logistic regression 与 通过caffe 学习到的logistic regression进行比较,如果caffe使用线性模型,那么二者的准备率差不多,而如果caffe使用非线性模型,即在网络中加入激活函数层(一般使用max(0,x)函数 ) ,那么caffe所学模型准备率会高大概10个百分点。
1、随机生成数据并可视化
import os import h5py import shutil import sklearn import tempfile import numpy as np import pandas as pd import sklearn.datasets import sklearn.linear_model import matplotlib.pyplot as plt X, y = sklearn.datasets.make_classification( n_samples=10000, n_features=4, n_redundant=0, n_informative=2, n_clusters_per_class=2, hypercube=False, random_state=0 ) # Split into train and test X, Xt, y, yt = sklearn.cross_validation.train_test_split(X, y) # Visualize sample of the data ind = np.random.permutation(X.shape[0])[:1000] df = pd.DataFrame(X[ind]) _ = pd.scatter_matrix(df, figsize=(9, 9), diagonal='kde', marker='o', s=40, alpha=.4, c=y[ind])
2、训练和测试 scikit-learn SGD 模型
# Train and test the scikit-learn SGD logistic regression. clf = sklearn.linear_model.SGDClassifier( loss='log', n_iter=1000, penalty='l2', alpha=1e-3, class_weight='auto') clf.fit(X, y) yt_pred = clf.predict(Xt) print('Accuracy: {:.3f}'.format(sklearn.metrics.accuracy_score(yt, yt_pred)))
本人准确率为0.4左右,而教程上为0.763,不晓得为何,测了好几次
3、使用caffe训练,先将刚才生成的数据转化成caffe能接受的数据格式
# Write out the data to HDF5 files in a temp directory. # This file is assumed to be caffe_root/examples/hdf5_classification.ipynb dirname = os.path.abspath('./hdf5_classification/data') if not os.path.exists(dirname): os.makedirs(dirname) train_filename = os.path.join(dirname, 'train.h5') test_filename = os.path.join(dirname, 'test.h5') # HDF5DataLayer source should be a file containing a list of HDF5 filenames. # To show this off, we'll list the same data file twice. with h5py.File(train_filename, 'w') as f: f['data'] = X f['label'] = y.astype(np.float32) with open(os.path.join(dirname, 'train.txt'), 'w') as f: f.write(train_filename + '\n') f.write(train_filename + '\n') # HDF5 is pretty efficient, but can be further compressed. comp_kwargs = {'compression': 'gzip', 'compression_opts': 1} with h5py.File(test_filename, 'w') as f: f.create_dataset('data', data=Xt, **comp_kwargs) f.create_dataset('label', data=yt.astype(np.float32), **comp_kwargs) with open(os.path.join(dirname, 'test.txt'), 'w') as f: f.write(test_filename + '\n')
然后,训练模型,当前目录为caffe_root
./build/tools/caffe train -solver examples/hdf5_classification/solver.prototxt
准确率真与SGD模型差不多
3、对caffe模型增加非线性变化层
./build/tools/caffe train -solver examples/hdf5_classification/solver2.prototxt
本人的准确率真提高到0.5 + ,网页教程上提高到0.8+,可以看出通过非线性变化准确率有很大的提高。如果仅使用线性模型,那么多层网络其实相当于一层而已