@InProceedings{
author = { Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei1, Stan Z. Li },
title = { Single-Shot Refinement Neural Network for Object Detection },
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}
论文地址: https://arxiv.org/abs/1711.06897
Introduction
RefineDet achieves the latest state-of-the-art results on generic object detection, and is the first real-time method to achieve detection accuracy above 80% mAP on PASCAL VOC 2007.
It is a combination of some classical architectures by inheriting the merits while overcoming their disadvantages. For example, anchors design is inherited from YOLO, feature fusion is an improvement of the way used in FPN, the backbone of ARM and the rule for matching prior boxes and ground truths is similar to SSD.
Details
Object detectors based on DNN can be divided into 2 categories: one-stage approach and two-stage approach. The former only makes once regression for bounding boxes and once classification for object categories, while the latter makes twice for each task. In general, one-stage methods have higher computational efficiency and two-stage methods achieve top performance.
In this paper, the author proposes a single-shot general object detector, called RefineDet which combines the advantages of both methods. Although doing twice regression and classification, RefineDet doesn’t process every ROI separately, so the author temporarily calls it 1.5-stage approach.
Architecture
RefineDet uses two inter-connected moudles, namely, the anchor refinement module (ARM) and the object detection module (ODM).
Like one-stage approach, RefineDet produces a fixed number of even distributed anchors as the initialization. Considering there is no object in the most anchors, this kind of methods have the problem about class imbalance and low accuracy. ARM makes an alleviation by (1) filtering out negative anchors to reduce search space for the classifier, and (2) coarsely adjusting the locations and sizes of anchors to provide better initialization for the subsequent regressor.
ODM takes the refined hard negative and refined positive anchors as the input from the ARM to further improve the regression accuracy and predict multi-class label. Two-step cascaded regression is more accurate than one-step way in some challenging scenarios, especially for the small objects.
TCB (transfer connection block) transfer the features in the ARM to predict locations, sizes and class labels of objects in the ODM. TCB adds the high-level features to the transferred features in the element-wise way and use the deconvolution operation to match the dimension, different from FPN which just using upsample. Feature fusion can improve detection accuracy.
Four different layers are used for prediction. High resolution map makes the detector “seeing” small objects clearly, and low resolution map, containing high level semantic information is better for detecting big objects.
Loss Function
The binary classification loss Lb is the cross-entropy/log loss over two classes (object vs. not object). The multi-class classification loss Lm is the softmax loss over multiple classes confidences. The smooth L1 loss is used as the regression loss Lr.