『论文笔记』A Benchmark on Tricks for Large-scale Image Retrieval

2. Pre-processing Tricks

2.1. Dataset Cleaning for Training

TR1

We noticed by visual inspection that the training set of GLD v1 is clean and reliable, so we used the dataset as is for training. To obtain a semi-supervised learning effect [4, 23], we added virtual classes to the training set. These virtual classes are the clusters from the test and index sets of GLD v1. First, we trained a baseline model using GLD v1 and extracted features of the test and index set of GLD v1. Then, DBSCAN was applied to generate clusters, where each cluster was assigned as a new virtual classes. We call the result TR1 for clarity.

TR2

GLD2的index set中包含很多非地标图片，为了训练模型更好的区分地标和非地标，作者设计了TR2。

和TR1类似，训练后进行聚类，然后“picked several distractor clusters as virtual classes”，将这些干扰标签和TR1的虚拟标签合并，作为TR2。

TR3

The training set of GLD v2 has more classes and images than GLD v1 does，同时也有更多噪声。先用二分类模型移除自然景观图片，然后对每个类别内部进行聚类，当一个类内有多个cluster时，保留最大的cluster。

（同时排除了和TR2重复的类别？）

2.2. Small-scale Validation

从训练集合中抽取2%，分别为测试集合和索引集合。然后包含一个噪音类的虚拟簇（We included a virtual class from a noise cluster），这是因为GLD v2包含有噪音，这样分布能够更加的贴合数据集。

2.3. Experimental Results

作者的目的就是剔出下图中展示的地标图的局部照片。

原始数据集的类内问题：the raw dataset includes images taken from inside, outside, and even partial viewpoints from within the same landmark.These kinds of datasets with large intra-class variation may interfere with learning proper representations in the model, especially when a pair-wise ranking loss is used

原始数据集的类间问题：images of nature scenes also make the training process hard as they have a little iter-class variation.

GLD v2噪声太大，没法直接用于训练。Valid是自建验证集，后面两个指标应该是数据集网站在作者提交模型后给的评分。

Training with TR1, which contains virtual classes from the test and index sets, improves the model performance by using unlabeled data when the original training set is not helpful anymore.

The model trained with TR2 gives performance similar to the model trained with TR1 because the number of data and classes is not noticeably different

Using TR3, which includes a cleaned training set from GLD v2, further improved

3. Learning Representations

3.1. Pooling

3.2. Objectives

Xent + Triplet

Triplet loss + classification loss such as cross-entropy (Xent) loss

N-pair + Angular

参见：

https://blog.csdn.net/qq_16234613/article/details/81210320

https://www.jianshu.com/p/a14b889a7aab

3.3. Training a Single Model

3.4. Experimental Results

Pooling

MAC最差

SPoC showed the best performance for N-pair + Angular models

GeM was the best pooling method for Xent + Triplet models.

Objectives

在TR1条件下Xent + Triplet表现最好；在TR3条件下，由于Xent损耗的波动，Xent + Triplet模型不能得到很好的训练。这可能是因为Xent损失对数据集的质量很敏感，因为它可能在refinement过程中获得少量重复的类。

Input size

It shows that performance rises as the input size increases as larger input sizes generate bigger feature maps in the convolutional neural network (CNN) models, which thus contain richer information. However, the performance does not keep increasing indefinitely, starting to decrease at a certain point.

4. Post-processing Tricks

feature ensemble, database augmentation (DBA), query expansion (QE), and reranking

4.1. Multiple Feature Ensemble

特征融合主要是探讨什么样的特征融合后才能够很好的提升模型的能力，是不是将表现最好的特征融合到一起后，模型能够在推理过程中得到真正的提升。

4.2. DBA and QE

DBA（Database-side augmentation）

Every feature in the database is replaced with a weighted sum of the point’s own value and those of its top k nearest neighbors (k-NN).

Database中的特征被每个数据的topk融合进行取代。

QE（query expansion）

It retrieves top k nearest neighbors from the database for each query and combines the retrieved neighbors with the original query.

This process is repeated with the number of necessity, and the final combined query is used to produce the ranked list of retrieved images.

将topk和原图进行融合后二（多）次查询。

4.3. PCA whitening

Typically, whitening is learned by a generative model in an unsupervised manner via principal component analysis (PCA) on an independent dataset.

We performed PCA whitening (PCAw) with 4096-dimensional features from DBA and QE to produce 1024-dimensional features by using the implementation in the Scikit-learn API [27]; then, we applied l2-normalization again.

4.4. Reranking

重排序技巧的影响相对于前面提到的技巧来说较小，因为重排序只有在前k个候选图像中找到ground-truth图像时才有效。

graph search based on the global descriptor

local matching based on the local descriptor

Graph search

就是先对数据集进行离线建图，然后依赖图搜索进行查找。

This method improves the retrieval of small objects and cluttered scenes in particular.

文献：

Efficient image retrieval via decoupling diffusion into online and offline processing

Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations

Local matching

就是类似sift特征的local匹配，一般只对global检索出的topk使用来reranking。

Because performing geometric verification on all possible pairs of query and index images is expensive for large-scale data, we applied local matching to only the top kd candidates, which is the retrieved result for the global descriptor

DELF [26] pre-trained on landmark dataset, and 1K local features were extracted from each image as described in the paper.

文献：

Largescale image retrieval with attentive deep local features

作者发现局部匹配有时会在毫不相关的两张图上匹配出多个点，所以作者只对匹配得分高于某个阈值的候选进行局部匹配。

4.5. Experimental Results

4.5.1 Multiple Feature Ensemble

没搞懂这块的分析。

Figure 4 (d), the result shows that there is a correlation between the low-performance variation of the models and better performance (red line)

由于方差很大，不推荐“best only”策略，就是要多试试的意思。

4.5.2 DBA/QE and PCAw

Iterative DBA and QE perform augmentation k times, but we used k = 1 for both DBA and QE as it performed the best.

4.5.3 Reranking

The concept of detect-to-retrieve [37] (D2R) has been proposed, and we used the landmark detector of [37] to detect and crop the region of interest. The cropped region is used for local feature extraction with DELF [26], which is listed as DELF Rerank+D2R in Table 5.

根据表5，DELF的引入甚至对结果有负面影响。

As shown in Figure 4 (e), DFS reranking with DBA/QE improved the performance by the hyper-parameter kq, while DFS reranking without DBA/QE was not helpful.

Furthermore, we conducted experiments by combining DFS with spatial verification (SV), which is denoted as DFS+SV. DFS+SV replaces a pairwise similarity measure of the cosine similarity between the global descriptors with the spatial matching score of the local descriptors obtained by DELF reranking.

作者用局部匹配得分取代global的cos距离用于DFS，这种方式计为DFS+SV。

4.5.4 Google Landmark Retrieval Challenge

For the final submission, we chose combination J from Table 4 and applied DBA and QE with k = 1 on each features. The features were then concatenated and PCAw was applied with an output dimensionality of 1024. Finally, the top 100 candidates were reranked using DFS+SV (NN=20K), which gave higher performance than using DFS only at this time.

posted @ 2021-10-23 16:20 叠加态的猫阅读(241) 评论(0) 编辑收藏举报

叠加态的猫