RCNN, Fast-RCNN, Faster-RCNN 图文详解 英文
RCNN, Fast-RCNN, Faster-RCNN
TL;DR
ref
progress
- RCNN: Two Stage
selective search
- CV wrapping
- pretrained CNN -> SVM classify
SVM
- Fast-RCNN: One Stage
RoI-Pooling
- Single-stage: single CNN pass -> single feature map
- Why Fast: share feature, reduce computation
cls
+reg
heads
- Faster-RCNN: Two Stage & Anchor-based
RPN
- Introcuce
Anchor
- Introcuce
RCNN
2014
- Selective Search
CNN
for every proposed regions- resize image
SVM
for each class- NMS
Experiment Details
Is there Pretraining
Yes.
Train CNN on classification dataset
a criminative model
How to construct the training set
- \(\geq\) 0.5 IOU as positives, others as negatives
- dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box
- warp all pixels(object + little context) around it to the required size.
How to train one the detection dataset
- 1000-way classification layer on ILSVRC to 21-way classification for detection(21 classes in
detection targets
)
We treat all region proposals with \(\geq\) 0:5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.
How to determine the Classification Threshold?
grid search
0.3 as confidence threshold in the paper.
Why not directly use the classification layer in CNN to preform region discrimination
We also discuss why it’s necessary to train detection classifiers rather than simply use outputs from the final layer (fc8) of the fine-tuned CNN.
TODO 『2022-02-28 [supplementary material]?』
Fast RCNN
From SPPNet to RoI Pooling
why do CNNs require a fixed input size?
In fact, convolutional layers do not require a fixed image size and can generate feature maps of any sizes.
On the other hand, the fully-connected layers need to have fixedsize/ length input by their definition. Hence, the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.
SPP = Concat(maxPool(scale,feature map))
Pipeline
still use selective search
- ROI Pooling(⭐ Making it fast)
- reuse conv features
- end-to-end
- ⭐
cls
+reg
detection heads
Faster RCNN
Contribution
RPN
- predefined bboxes(anchors) with different scales
The rest is inherited from Fast RCNN
RPN
learns the proposals from feature maps
🔍 see also RPN
Pipeline
- given raw image \(\mathbb{R}^{H_{0}\times W_{0}}\), feed it to a backbone
the images will be down scaled to \(C_{\text{backbone}}\) tiled feature maps \(\mathbf{f} \in \mathbb{R}^{C_{\text{backbone}} \times H\times W }\)
e.g. for vgg16, the image is reduced by \(\mathcal{R} = 16\) times by the end of the backbone
- for each point \(p = (x,y)\) in \(\mathbf{f}\)
First, generate \(k\) anchors
🔍 see below, what are anchors
Then apply a small convnet
of size \(n\times n\) centered at \((x,y)\) on the feature map
❗ About how to adjust the predifined anchor boxes
- this
convnet
of size \(n\times n\) is compared tosliding window
in the paper. - this
convnet
is composed of \(D\)filters
, and each filter is of size \(n\times n\)
at each point \(p\), \(D=256\) filters will produce \(256\) numbers as a local linear composion(i.e. what CNN does) of a neighbor \(n\times n\).
So, given tiled feature maps \(\mathbf{f} \in \mathbb{R}^{H\times W}\), we will get \(\mathbf{f}_{rpn} \in \mathbb{R}^{256\times H \times W}\)
- apply 1x1 conv to refine anchors
from \(\mathbf{f}_{rpn} \in \mathbb{R}^{256\times H \times W}\) to
- \(\mathcal{S}_{\text{cls}}\in \mathbb{R}^{2(\text{cls}) \times 9(\text{\#anchor}) \times H \times W}\)
\(cls\) is background or not(2-classification)
- \(\mathcal{S}_{\text{reg}}\in \mathbb{R}^{4(\text{x,y,w,h}) \times 9(\text{\#anchor}) \times H \times W}\)
Anchor
At each anchor point
, there will be \(k\) predifined boxes in the original image
actually it doesn't matter where it lands.
If anchors are defind and refined with the coords on the feature map, then there should be one more step of back projecting them to the original image.
there are \(H\times W\) anchor points in the feature map, hence \(k \times H \times W\) anchors totally.
And there will be a set of predefined anchors in the original image \(I\) at every stride \(\mathcal{R}\).
Train RPN
- determine loss
how assign ground truth to anchors during training
some are kept, a lot are abandoned.
- a single ground-truth box may assign positive labels to multiple anchors.
We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over- Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with 5 any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors.
Usually the second condition is sufficient to determine the positive samples;
but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample.We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute(abandoned) to the training objective.
- a set of bounding-box regressors k are learned.
To account for varying sizes, a set of k bounding-box regressors are learned.
Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights
- Loss =
cls
+reg
- as in
Fast-RCNN
- as in
- sampling
We follow the
image-centric
sampling strategy from [2] to train this network.Each mini-batch arises from a single image that contains many positive and negative example anchors.
randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1.
If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.
- training strategy
refer to 3.2 Sharing Features for RPN and Fast R-CNN in the paper
first train RPN,
and use the proposals to train Fast R-CNN.
The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.
Why Faster?
Using RPN yields a much faster detection system than using either SS or EB because of shared convolutional computations; the fewer proposals also reduce the region-wise fully-connected layers’ cost
extract feature from a one-pass CNN feature map
Why using NMS?
One Point has many corresponding Prior-guided Anchors
NMS
is performed after RPN
To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores.
As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals.
After NMS, we use the top-N ranked proposal regions for detection.
See Also
https://whatdhack.medium.com/a-deeper-look-at-how-faster-rcnn-works-84081284e1cd
https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection
本文来自博客园,作者:ZXYFrank,转载请注明原文链接:https://www.cnblogs.com/zxyfrank/p/16559357.html