Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

1 Introduction

1.1 Instance discrimination (样本判别)

Instance discrimination 制定了一种划分正样本和负样本的规则

有一个数据集，里面有N张图片，随机选择一张图片 $x_1$ ，经过不同的Data Transformation得到正样本对，其他的图片 $x_2, x_3, ..., x_n$ 叫做负样本，经过Encoder(also called Feature Extractor)得到对应的feature

$x_i$ : 表示第 $i$ 张图片
$T_j$ : 表示Data Transformation(Data Augmentation)
$x_1^{(1)}$ : anchor锚点, 基准点
$x_1^{(2)}$ : 基于anchor的positive sample
$x_2, x_3, ..., x_n$ : 基于anchor的negative sample

1.2 Momentum

动量在数学上可以理解为是一种指数移动平均(Exponential Moving Average)

$m$ 为动量系数，目的是为了 $Y_t$ 不完全依赖于当前时刻的输入 $X_t$

Y_{t} = m Y_{t - 1} + (1 - m) X_{t}

$Y_t = mY_{t-1} + (1-m)X_t$

$Y_t$ : 当前时刻的输出
$Y_{t-1}$ : 上一时刻的输出
$X_t$ : 当前时刻的输入

$m$ 越大，对当前时刻的输入 $X_t$ 的依赖越小

1.3 Momentum Contrast (MoCo)

the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning

2.1 Loss Function(目标损失函数)

2.2 Pretext tasks(代理任务)

3 Method

3.1 InfoNCE Loss

CrossEntropyLoss

In $softmax$ formula, the probabilty of the true sample(class) through model computes:

{\hat{y}}_{+} = s o f t m a x (z_{+}) = \frac{\exp (z_{+})}{\sum_{i = 1}^{K} \exp (z_{i})}

$\hat{y}_+ = softmax(z_+) = \dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)}$

In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:

\begin{aligned} C r o s s E n t r o p y L o s s & = - \sum y_{i} \log (\hat{y}) \\ = - \log {\hat{y}}_{+} \\ = - \log s o f t m a x (z_{+}) \\ = - \log \frac{\exp (z_{+})}{\sum_{i = 1}^{K} \exp (z_{i})} \end{aligned}

$\begin{align*} CrossEntropyLoss& = -\sum y_i\log(\hat{y}) \\ & = -\log \hat{y}_+\\ & = -\log softmax(z_+)\\ & = -\log\dfrac{\exp(z_+)}{\sum_{i=1}^K \exp(z_i)} \end{align*}$

$K$ 为 num_labels (样本中的类别数)

InfoNCE Loss

那么为什么不能使用CrossEntropyLoss作为Contrast Learning的Loss Function呢？因为在Contrast Learning中， $K$ 的值很大（例如ImageNet里样本集有128万张图片，也就是128万类），softmax无法处理如此多的类别，而且exponential operation维度很高，计算复杂度很高。

$\quad$ Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query $q$ and a set of encoded samples { $k_0, k_1, k_2, ...$ } that are the keys of a dictionary. Assume that there is a single key (denoted as $k_+$ ) in the dictionary that $q$ matches.
$\quad$ A contrastive loss is a function whose value is low when q is similar to its positive key $k_+$ and dissimilar to all other keys (considered negative keys for $q$ ). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.

L_{q} = - \log \frac{\exp (q * k_{+} / τ)}{\sum_{i = 0}^{K} \exp (q * k_{i} / τ)}

$\mathcal{L_q} = -\log \dfrac{\exp(q*k_+/\tau)}{\sum_{i=0}^K \exp(q*k_i/\tau)}$

$q$ 为 $feature_{archor}$
$k_i(i=0,...,K)$ 为 1个 $feature_{positive}$ + $K$ 个 $feature_{negative}$
$K$ 为负采样后负样本的数量
$\tau$ means temperature, which is a hyper-parameter
The sum is over one positive and $K$ negative samples. Intuitively, this loss is the log loss of a ( $K+1$ )-way softmax-based classifier that tries to classify $q$ as $k_+$ .

In general, the query representationis $q = f_q(x^{query})$ where $f_q$ is an encoder network and $x^{query}$ is a query sample (likewise, $k = f_k(x^{key})$ ). Their instantiations depend on the specific pretext task. The input $x^{query}$ and $x^{key}$ can be images, patches, or context consisting a set of patches. The networks $f_q$ and $f_k$ can be identical, partially shared, or different.