Momentum Contrast (MoCo) for Unsupervised Visual Representation Learning
1 Introduction
1.1 Instance discrimination (样本判别)
Instance discrimination 制定了一种划分正样本和负样本的规则
有一个数据集,里面有N张图片,随机选择一张图片 ,经过不同的Data Transformation得到正样本对,其他的图片 叫做负样本,经过Encoder(also called Feature Extractor)得到对应的feature

- : 表示第 张图片
- : 表示Data Transformation(Data Augmentation)
- : anchor锚点, 基准点
- : 基于anchor的positive sample
- : 基于anchor的negative sample
1.2 Momentum
动量在数学上可以理解为是一种指数移动平均(Exponential Moving Average)
为动量系数,目的是为了 不完全依赖于当前时刻的输入
- : 当前时刻的输出
- : 上一时刻的输出
- : 当前时刻的输入
越大,对当前时刻的输入 的依赖越小
1.3 Momentum Contrast (MoCo)

2 Related Work

the relationships between Supervised Learning and Unsupervised Learning/Self-supervised Learning
2.1 Loss Function(目标损失函数)
2.2 Pretext tasks(代理任务)
3 Method
3.1 InfoNCE Loss
CrossEntropyLoss
In formula, the probabilty of the true sample(class) through model computes:
In Supervised Learning, Groud Truth is the one-hot vector(for instance: [0,1,0,0], where K=4), So the cross entropy loss:
- 为 num_labels (样本中的类别数)
InfoNCE Loss
那么为什么不能使用CrossEntropyLoss作为Contrast Learning的Loss Function呢?因为在Contrast Learning中,的值很大(例如ImageNet里样本集有128万张图片,也就是128万类),softmax无法处理如此多的类别,而且exponential operation维度很高,计算复杂度很高。
Contrastive Learning can be thought of as training an Encoder for a dictionary look-up task. Consider an encoded query and a set of encoded samples {} that are the keys of a dictionary. Assume that there is a single key (denoted as ) in the dictionary that matches.
A contrastive loss is a function whose value is low when q is similar to its positive key and dissimilar to all other keys (considered negative keys for ). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE.
- 为
- 为 1个 + 个
- 为负采样后负样本的数量
- means temperature, which is a hyper-parameter
- The sum is over one positive and negative samples. Intuitively, this loss is the log loss of a ()-way softmax-based classifier that tries to classify as .
In general, the query representationis where is an encoder network and is a query sample (likewise, ). Their instantiations depend on the specific pretext task. The input and can be images, patches, or context consisting a set of patches. The networks and can be identical, partially shared, or different.
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· spring官宣接入deepseek,真的太香了~