Loading

RCNN, Fast-RCNN, Faster-RCNN 图文详解 英文

RCNN, Fast-RCNN, Faster-RCNN

TL;DR

ref

progress

  • RCNN: Two Stage
    • selective search
    • CV wrapping
    • pretrained CNN -> SVM classify
    • SVM
  • Fast-RCNN: One Stage
    • RoI-Pooling
    • Single-stage: single CNN pass -> single feature map
      • Why Fast: share feature, reduce computation
    • cls + reg heads
  • Faster-RCNN: Two Stage & Anchor-based
    • RPN
      • Introcuce Anchor

RCNN

2014

  • Selective Search
  • CNN for every proposed regions
    • resize image
  • SVM for each class
  • NMS

Experiment Details

Is there Pretraining

Yes.

Train CNN on classification dataset

a criminative model

How to construct the training set

  • \(\geq\) 0.5 IOU as positives, others as negatives
  • dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box
  • warp all pixels(object + little context) around it to the required size.

How to train one the detection dataset

  • 1000-way classification layer on ILSVRC to 21-way classification for detection(21 classes in detection targets)

We treat all region proposals with \(\geq\) 0:5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.

How to determine the Classification Threshold?

grid search

0.3 as confidence threshold in the paper.

Why not directly use the classification layer in CNN to preform region discrimination

We also discuss why it’s necessary to train detection classifiers rather than simply use outputs from the final layer (fc8) of the fine-tuned CNN.

TODO 『2022-02-28 [supplementary material]?

Fast RCNN

From SPPNet to RoI Pooling

https://towardsdatascience.com/understanding-sppnet-for-object-detection-and-classification-682d6d2bdfb

why do CNNs require a fixed input size?

In fact, convolutional layers do not require a fixed image size and can generate feature maps of any sizes.

On the other hand, the fully-connected layers need to have fixedsize/ length input by their definition. Hence, the fixedsize constraint comes only from the fully-connected layers, which exist at a deeper stage of the network.

SPP = Concat(maxPool(scale,feature map))

Pipeline

still use selective search

  • ROI Pooling(⭐ Making it fast)
    • reuse conv features
    • end-to-end
  • cls + reg detection heads

Faster RCNN

Contribution

  • RPN
    • predefined bboxes(anchors) with different scales

The rest is inherited from Fast RCNN

RPN

learns the proposals from feature maps

🔍 see also RPN

Pipeline

  • given raw image \(\mathbb{R}^{H_{0}\times W_{0}}\), feed it to a backbone

the images will be down scaled to \(C_{\text{backbone}}\) tiled feature maps \(\mathbf{f} \in \mathbb{R}^{C_{\text{backbone}} \times H\times W }\)

\[H = \frac{H_{0}}{\mathcal{R}} \]

e.g. for vgg16, the image is reduced by \(\mathcal{R} = 16\) times by the end of the backbone

  • for each point \(p = (x,y)\) in \(\mathbf{f}\)

First, generate \(k\) anchors

🔍 see below, what are anchors

Then apply a small convnet of size \(n\times n\) centered at \((x,y)\) on the feature map

About how to adjust the predifined anchor boxes

  1. this convnet of size \(n\times n\) is compared to sliding window in the paper.
  2. this convnet is composed of \(D\) filters, and each filter is of size \(n\times n\)

at each point \(p\), \(D=256\) filters will produce \(256\) numbers as a local linear composion(i.e. what CNN does) of a neighbor \(n\times n\).

So, given tiled feature maps \(\mathbf{f} \in \mathbb{R}^{H\times W}\), we will get \(\mathbf{f}_{rpn} \in \mathbb{R}^{256\times H \times W}\)

  • apply 1x1 conv to refine anchors

from \(\mathbf{f}_{rpn} \in \mathbb{R}^{256\times H \times W}\) to

  1. \(\mathcal{S}_{\text{cls}}\in \mathbb{R}^{2(\text{cls}) \times 9(\text{\#anchor}) \times H \times W}\)

\(cls\) is background or not(2-classification)

  1. \(\mathcal{S}_{\text{reg}}\in \mathbb{R}^{4(\text{x,y,w,h}) \times 9(\text{\#anchor}) \times H \times W}\)

Anchor

At each anchor point, there will be \(k\) predifined boxes in the original image

actually it doesn't matter where it lands.

If anchors are defind and refined with the coords on the feature map, then there should be one more step of back projecting them to the original image.

there are \(H\times W\) anchor points in the feature map, hence \(k \times H \times W\) anchors totally.

And there will be a set of predefined anchors in the original image \(I\) at every stride \(\mathcal{R}\).

Train RPN

  1. determine loss

how assign ground truth to anchors during training

some are kept, a lot are abandoned.

  • a single ground-truth box may assign positive labels to multiple anchors.

We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over- Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with 5 any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors.

Usually the second condition is sufficient to determine the positive samples;
but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample.

We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute(abandoned) to the training objective.

  • a set of bounding-box regressors k are learned.

To account for varying sizes, a set of k bounding-box regressors are learned.

Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights

  • Loss = cls + reg
    • as in Fast-RCNN

  1. sampling

We follow the image-centric sampling strategy from [2] to train this network.

Each mini-batch arises from a single image that contains many positive and negative example anchors.

randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1.

If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.

  1. training strategy

refer to 3.2 Sharing Features for RPN and Fast R-CNN in the paper

first train RPN,
and use the proposals to train Fast R-CNN.
The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.

Why Faster?

Using RPN yields a much faster detection system than using either SS or EB because of shared convolutional computations; the fewer proposals also reduce the region-wise fully-connected layers’ cost

extract feature from a one-pass CNN feature map

Why using NMS?

One Point has many corresponding Prior-guided Anchors

NMS is performed after RPN

To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores.

As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals.

After NMS, we use the top-N ranked proposal regions for detection.

See Also

https://whatdhack.medium.com/a-deeper-look-at-how-faster-rcnn-works-84081284e1cd

https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection

posted @ 2022-08-07 16:47  ZXYFrank  阅读(114)  评论(0编辑  收藏  举报