Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning
1 Introduction
1.1 Instance discrimination (样本判别)
Instance discrimination 制定了一种划分正样本和负样本的规则
有一个数据集,里面有N张图片,随机选择一张图片 \(x_1\),经过不同的Data Transformation得到正样本对,其他的图片 \(x_2, x_3, ..., x_n\) 叫做负样本,经过Encoder(also called Feature Extractor)得到对应的feature
- \(x_i\): 表示第 \(i\) 张图片
- \(T_j\): 表示Data Transformation(Data Augmentation)
- \(x_1^{(1)}\): anchor锚点, 基准点
- \(x_1^{(2)}\): 基于anchor的positive sample
- \(x_2, x_3, ..., x_n\): 基于anchor的negative sample
1.2 Momentum
动量在数学上可以理解为是一种指数移动平均(Exponential Moving Average)
\(m\)为动量系数,目的是为了 \(Y_t\) 不完全依赖于当前时刻的输入 \(X_t\)
- \(Y_t\): 当前时刻的输出
- \(Y_{t-1}\): 上一时刻的输出
- \(X_t\): 当前时刻的输入
\(m\)越大,对当前时刻的输入 \(X_t\) 的依赖越小
1.3 Momentum Contrast (MoCo)
2 Related Work
the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning
2.1 Loss Function(目标损失函数)
2.2 Pretext tasks(代理任务)
3 Method
3.1 InfoNCE Loss
CrossEntropyLoss
In \(softmax\) formula, the probabilty of the true sample(class) through model computes:
In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:
- \(K\) 为 num_labels (样本中的类别数)
InfoNCE Loss
那么为什么不能使用CrossEntropyLoss作为Contrast Learning的Loss Function呢?因为在Contrast Learning中,\(K\)的值很大(例如ImageNet里样本集有128万张图片,也就是128万类),softmax无法处理如此多的类别,而且exponential operation维度很高,计算复杂度很高。
\(\quad\) Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query \(q\) and a set of encoded samples {\(k_0, k_1, k_2, ...\)} that are the keys of a dictionary. Assume that there is a single key (denoted as \(k_+\)) in the dictionary that \(q\) matches.
\(\quad\)A contrastive loss is a function whose value is low when q is similar to its positive key \(k_+\) and dissimilar to all other keys (considered negative keys for \(q\)). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.
- \(q\) 为 \(feature_{archor}\)
- \(k_i(i=0,...,K)\) 为 1个\(feature_{positive}\) + \(K\)个\(feature_{negative}\)
- \(K\) 为负采样后负样本的数量
- \(\tau\) means temperature, which is a hyper-parameter
- The sum is over one positive and \(K\) negative samples. Intuitively, this loss is the log loss of a (\(K+1\))-way softmax-based classifier that tries to classify \(q\) as \(k_+\).
In general, the query representationis \(q = f_q(x^{query})\) where \(f_q\) is an encoder network and \(x^{query}\) is a query sample (likewise, \(k = f_k(x^{key})\)). Their instantiations depend on the specific pretext task. The input \(x^{query}\) and \(x^{key}\) can be images, patches, or context consisting a set of patches. The networks \(f_q\) and \(f_k\) can be identical, partially shared, or different.