炼丹常识:关于single crop/multiple crops
关于single crop/multiple crops
什么是single crop/multiple crop
对于一个分类网络比如AlexNet,在测试阶段,使用single crop/multiple crop得到的结果是不一样的[0],相当于将测试图像做数据增强。
shicaiyang(星空下的巫师)说[1],训练的时候当然随机剪裁,但测试的时候有技巧:
- 单纯将测试图像resize到某个尺度(例如256xN),选择其中centor crop(即图像正中间区域,比如224x224),作为CNN的输入,去评估该模型
- Multiple Crop的话具体形式有多种,可自行指定,比如:
- 10个crops: 取(左上,左下,右上,右下,正中)以及它们的水平翻转。这10个crops在CNN下的预测输出取平均作为最终预测结果。
- 144个crops:这个略复杂,以ImageNet为例:
- 首先将图像resize到4个尺度(比如256xN,320xN,384xN,480xN)
- 每个尺度上去取(最左,正中,最右)3个位置的正方形区域
- 对每个正方形区域,取上述的10个224x224的crops,则得到4x3x10=120个crops
- 对上述正方形区域直接resize到224x224,以及做水平翻转,则又得到4x3x2=24个crops
- 总共加起来得到120+24=144个crops,所有crops的预测输出的平均作为整个模型对当前测试图像的输出
上述10个crop的做法,在ZFNet[2]论文有提到:
The model was trained on the ImageNet 2012 training
set (1.3 million images, spread over 1000 different
classes). Each RGB image was preprocessed by resizing
the smallest dimension to 256, cropping the center
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 dierent sub-crops of size
224x224 (corners + center with(out) horizontal
ips).
其实更早的AlexNet原文[3]说得更详细,为了防止过拟合而做了multiple crops:
The easiest and most common method to reduce overfitting on image data is to artificially enlarge
the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms
of data augmentation, both of which allow transformed images to be produced from the original
images with very little computation, so the transformed images do not need to be stored on disk.
In our implementation, the transformed images are generated in Python code on the CPU while the
GPU is training on the previous batch of images. So these data augmentation schemes are, in effect,
computationally free.
The first form of data augmentation consists of generating image translations and horizontal reflections.
We do this by extracting random 224x224 patches (and their horizontal reflections) from the
256x256 images and training our network on these extracted patches4. This increases the size of our
training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent.
Without this scheme, our network suffers from substantial overfitting, which would have
forced us to use much smaller networks. At test time, the network makes a prediction by extracting
five 224 x 224 patches (the four corner patches and the center patch) as well as their horizontal
reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax
layer on the ten patches.
不过在Caffe自带的alexnet例子(/models/bvlc_alexnet/train_val.prototxt
)中, 无论是train阶段还是test阶段,都是single crop。它通过在prototxt中data层设定crop_size
来做到:
name: "AlexNet"
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 227
mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
}
data_param {
source: "examples/imagenet/ilsvrc12_train_lmdb"
batch_size: 256
backend: LMDB
}
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mirror: false
crop_size: 227
mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
}
data_param {
source: "examples/imagenet/ilsvrc12_val_lmdb"
batch_size: 50
backend: LMDB
}
}
翻看具体Caffe[3]是怎么实现上述crop的代码,发现在训练阶段是从0到im_height-crop_size+1
之间的随机数作为边长,测试阶段则是(im_height-crop_size)/2
为边长,具体实现在src/data_transformer.cpp
中:
int h_off = 0;
int w_off = 0;
if (crop_size) {
height = crop_size;
width = crop_size;
// We only do random crop when we do training.
if (phase_ == TRAIN) {
h_off = Rand(datum_height - crop_size + 1);
w_off = Rand(datum_width - crop_size + 1);
} else {
h_off = (datum_height - crop_size) / 2;
w_off = (datum_width - crop_size) / 2;
}
}
结论
所以对于Caffe框架,在一个iteration上,Caffe的data layer做的是single crop; 如果考虑多个iteration,那么训练阶段可以认为是multiple crops,毕竟边长是随机的,多次之间总有差别;但是在测试阶段,Caffe的data layer的确是single crop,如果要multiple crops就只好自行修改了,比如将原始图像用代码crop出多份存到磁盘,或者用pycaffe做测试,边执行multple crops边执行测试。
refs
- https://www.zhihu.com/question/268494717/answer/356102226 王晋东不在家的回答下,何钦尧的评论
- http://caffecn.cn/?/question/428
- Visualizing and Understanding Convolutional Networks, arXiv.1311.2901
- https://github.com/BVLC/caffe