Local Relation Networks for Image Recognition 英文详解

Local Relation Network

Adapt filter according to the appearance affinity

Motivation

Meaningful and adaptive spatical aggregation

Humans have a remarkable ability to “see the infinite world with finite means” [26, 2].

Recognition-by-components: a theory of human image understanding.

W. von Humboldt. On Language: On the Diversity of Human Language Construction and Its Influence on the Mental Development of the Human Species. Cambridge Texts in the History of Philosophy. Cambridge University Press, 1999/1836.

Hierarchical features -> different levels of features

Rather than recognizing how elements can be meaningfully joined together, convolutional layers act as templates

1 filter -> 1 channel
it'a waste of channels.

local relation layer

locality & geometric priors
determine feature buttom up

Convolution Layers and its Evolution

accuracy-efficiency trade-off
- group convolution
- depth-wise convolution
enlarge receptive field
- dialated convolution
- deformable convolution
- active convolution
relax the requirement for sharing weights(this is too rigid)
- locally connected layers(DeepFace)
Capsule Networks

self-enhancement

filter bubble

Given that we prefer to eschew negative experiences, it comes as no surprise that people avoid the immediate psychological discomfort from cognitive dissonance by simply not reading or listening to differing opinions.

Self-Attention/ Graph Neural Networks
- for long-range context

This work

a new feature extractor
introducing the compositionality directly into represention

Some concepts

bottom-up & top-down aggregation

geometric prior
locality

Algorithm

Local-Relation Networks are LR-nets

Suppose

\(C = 24, m = 8, k = 7,C/m = 3\)

We observe no accuracy drop with up to 8 channels (default) sharing the same aggregation(for k)

\(H = 160,W = 160\)

In this architecture, receptive field is relevant to the concept of Geometry Prior
Or rather, learned Geometry Prior is used with neighborhood(similar to receptive field.)

k is the neighborhood size

Geometry Prior is analogous to conventional convolution filter
However, geometry prior is considered together with appearance composability, which brings about adaption from input
In other words, the geometry prior is conditioned on the input pixels' correlation.

Input Feature Map 24x160x160
- 1x1 conv
  - Query: 3x160*160(compress #channels from 24 to 3)
    - 160x160 points, every point has a query in c/m = 3 channels.
  - Key: 3x160*160
    - with kernel size k = 7, there are many regions in key maps
- 1x1 conv
  - Geometry Prior: 3x7x7
    - for every region/neighbor
\(W_{neighbour} = \text{SoftMax}(\text{Geo.}+\text{App.})\)(Geometry and Appearance)
\(\text{pixel}_{x,y} = W_{neighbour}\text{Input}_{neighbour}\)
- \(neighbor\) is kernel of size k centered at x,y(the source and target pixel position.)
All of the aggregation is performed in a receptive field of kxk

Design and Analysis

\[W = \text{SoftMax}(\text{GeoPrior}+\text{AppearanceComposability}) \]

Locality

They claim that LR(i.e. Local-Relation Layer) can utilize large kernels more effectively

This difference may be due to the representation power of convolution layer being bottlenecked by the number of fixed filters, hence there is no benefit from a larger kernel size.

Weight Sharing across different positions in an image limits the utilization of the representation power of large kernels.

Appearance composability

While in previous works the query and key are vectors, in the local relation layer, we use scalars to represent them so that the computation and representation are lightweight.

Geometric Prior

What is that? 😂

Questions

We always use image classification as the task to test architectures.

I think classification may not need very detailed/fine-grained feature information.

Hence, it cannot measure the capability of extracting highly representative features.

posted @ 2022-03-19 20:52 ZXYFrank 阅读(61) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

ZXYFrank

Enjoy the process🍀