Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning

1 Introduction

1.1 Instance discrimination (样本判别)

Instance discrimination 制定了一种划分正样本和负样本的规则

有一个数据集,里面有N张图片,随机选择一张图片 x1,经过不同的Data Transformation得到正样本对,其他的图片 x2,x3,...,xn 叫做负样本,经过Encoder(also called Feature Extractor)得到对应的feature

  • xi: 表示第 i 张图片
  • Tj: 表示Data Transformation(Data Augmentation)
  • x1(1): anchor锚点, 基准点
  • x1(2): 基于anchor的positive sample
  • x2,x3,...,xn: 基于anchor的negative sample

1.2 Momentum

动量在数学上可以理解为是一种指数移动平均(Exponential Moving Average)

m为动量系数,目的是为了 Yt 不完全依赖于当前时刻的输入 Xt

Yt=mYt1+(1m)Xt
  • Yt: 当前时刻的输出
  • Yt1: 上一时刻的输出
  • Xt: 当前时刻的输入

m越大,对当前时刻的输入 Xt 的依赖越小

1.3 Momentum Contrast (MoCo)

2 Related Work

the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning

2.1 Loss Function(目标损失函数)

2.2 Pretext tasks(代理任务)

3 Method

3.1 InfoNCE Loss

CrossEntropyLoss

In softmax formula, the probabilty of the true sample(class) through model computes:

y^+=softmax(z+)=exp(z+)i=1Kexp(zi)

In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:

CrossEntropyLoss=yilog(y^)=logy^+=logsoftmax(z+)=logexp(z+)i=1Kexp(zi)
  • K 为 num_labels (样本中的类别数)

InfoNCE Loss

那么为什么不能使用CrossEntropyLoss作为Contrast Learning的Loss Function呢?因为在Contrast Learning中,K的值很大(例如ImageNet里样本集有128万张图片,也就是128万类),softmax无法处理如此多的类别,而且exponential operation维度很高,计算复杂度很高。

Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query q and a set of encoded samples {k0,k1,k2,...} that are the keys of a dictionary. Assume that there is a single key (denoted as k+) in the dictionary that q matches.
A contrastive loss is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys (considered negative keys for q). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.

Lq=logexp(qk+/τ)i=0Kexp(qki/τ)
  • qfeaturearchor
  • ki(i=0,...,K) 为 1个featurepositive + Kfeaturenegative
  • K 为负采样后负样本的数量
  • τ means temperature, which is a hyper-parameter
  • The sum is over one positive and K negative samples. Intuitively, this loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k+.

In general, the query representationis q=fq(xquery) where fq is an encoder network and xquery is a query sample (likewise, k=fk(xkey)). Their instantiations depend on the specific pretext task. The input xquery and xkey can be images, patches, or context consisting a set of patches. The networks fq and fk can be identical, partially shared, or different.

posted @   ForHHeart  阅读(30)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· spring官宣接入deepseek,真的太香了~
点击右上角即可分享
微信分享提示