浅聊对比学习（Contrastive Learning）第二弹：MINE+SimCLR+SimCLRV2

会员服务 ·

浅聊对比学习（Contrastive Learning）第二弹：MINE+SimCLR+SimCLRV2

2022 年 7 月 17 日 PaperWeekly

©作者 | 吴桐

研究方向 | 推荐系统

本文接上篇文章：

浅聊对比学习（Contrastive Learning）第一弹

这次主要是想记录下最近读的三篇对比学习的经典 paper：

SimCLR-A Simple Framework for Contrasting Learning of Visual Representations

https://arxiv.org/abs/2002.05709

SimCLRV2-Big Self-Supervised Models are Strong Semi-Supervised Learners

https://arxiv.org/abs/2006.10029

Mutual Information Neural Estimation

https://arxiv.org/abs/1801.04062

题外话：SimCLR 和 SimCLRV2 看完后觉得 CV 的炼丹之路：路漫漫其修远兮。

SimCLR

在组内做paper reading的记录：

https://bytedance.feishu.cn/docx/doxcn0hAWqSip1niZE2ZCpyO4yb

顺带安利下公司的飞书文档，YYDS！

最近学到一个新词儿：「缝合怪」~

1.1 One Sentence Summary

提出了一个对比学习的框架，在今天看来框架不算特别新颖，但做了非常详实的实验，来验证对比学习到底能从哪些方面受益；个人感觉更像是一篇实验总结性 paper：告诉你对比学习的一些特性，提供一些实操中的调参方向。

1.2 算法

loss 就是 softmax loss + temperature
这里有个需要注意的点，他在学到的 f 后面加了一个 g，然后进行 loss 的计算和梯度的回传，最后真正在使用的时候用的只有 f，这里作者的猜想和实验确实很有意思，很值得借鉴。（虽然在 SimCLRV2 中这个结论有变的很不一样了哈哈哈，后面会讲到）
框架图：模型训练的时候 f 和 g 同时训练，但是训练完后，就把 g 扔掉了，只保留 f，这里是我觉得这篇论文最有意思的点，一般的思路其实应该是直接用 f 的输出去计算对比学习的 loss~

1.3 实验

实验图太多就不贴了，想看 detail 的可以直接看我的 paper_reading 记录：

https://bytedance.feishu.cn/docx/doxcn0hAWqSip1niZE2ZCpyO4yb

只对个人认为有意思的点说下。

▲ Furthermore, even when nonlinear projection is used, the layer before the projection head, h, is still much better (>10%) than the layer after, z = g(h), which shows that the hidden layer before the projection head is a better representation than the layer after

实验证明「组合不同的 data augumentation」的重要性
实验证明对比学习需要比有监督学习需要更强的 data agumentation
实验证明无监督对比学习会在更大的模型上受益
实验证明了不同 loss 对效果的影响

SimCLRV2

在组内做 paper reading 的记录：

https://bytedance.feishu.cn/docx/doxcn9P6oMZzuwZOrYAhYc5AUUe

2.1 One Sentence Summary

提出了一个基于对比学习半监督学习框架，在 SimCLR 后面接了一个 distillation 模块在做半监督学习，从而很好的提升了效果。（这篇 paper 很有意思的点：用了 self-distillation 居然会比原始的模型效果更好，这一点在组内做 paper reading 的时候，同事提出来了这个问题，后来调研了下，主要是说做 self-distillation 会让 teacher 模型学到不同的视图）

2.2 算法

分为三步：pre-training、fine-tuning、distillation(self-distillation)

分为三个阶段：

Pretraining：用了 SimCLR，然后用了更深的 projection head，同时不是直接把 projection head 丢了，而是用 projection head 的 1st layer，来做后面的 fine tune。。。（这也太 trick 了吧）
Fine-tune：用 SimCLR 训练出来的网络，接下游的 MLP，做 classification 任务，进行 fine-tune。
disitill：用训好的 teacher network 给无 label 的数据打上标签，作为 ground truth，送给 student network 训练。这个地方作者也尝试了加入有标签的（样本+标签），发现差距不大，所以就没有使用有监督样本的（样本+标签）。

2.3 实验结果

实验图就不全贴了，想看 detail 的可以直接看我的 paper_reading 记录：

https://bytedance.feishu.cn/docx/doxcn9P6oMZzuwZOrYAhYc5AUUe

只对个人认为有意思的点说下。

看一张放在论文首页的图吧（按照李沐大神的话说，放在首页的图一定是非常牛逼的图！），确实可以看出只用 1% 的数据+标签，就能获得到和有监督学习用 100% 数据+标签的效果；用 10% 的数据+标签就已经超过 SOTA 了，确实还是挺牛逼的。

● Distillation Using Unlabeled Data Improves Semi-Supervised Learning

这个实验很有意思的点是，无论是 self-distillation 还是 distill small model，效果都比 teacher model 效果好，这里的解释可以看这里：Link | arxiv，很有趣~主要在说：distill 能让 student 学到更多的视图，从而提升了效果~

● Bigger Models Are More Label-Efficient

● Bigger/Deeper Projection Heads Improve Representation Learning

MINE

在组内做 paper reading 的记录：

https://bytedance.feishu.cn/docx/doxcnMHzZBeWFV4HZV6NAU7W73o

3.1 One Sentence Summary

将互信息加入到 loss function 中，从理论上证明了 MINE 更 flexible/scalable；减轻了 GAN 中的 mode-dropping 问题，同时提高了 ALI 的重建和推理效果。（这篇文章的理论推导很多，感兴趣的建议看原 paper 或者 Link）

3.2 算法

给定一个 batch 内的数据，其中就是我们的训练集，batch_size = b。
然后从训练集中随机 sample b 个 z，作为负样本。
计算 loss，并回传梯度。

3.3 实验

实验图就不全贴了，想看 detail 的可以直接看我的 paper_reading 记录：

https://bytedance.feishu.cn/docx/doxcnMHzZBeWFV4HZV6NAU7W73o

只对个人认为有意思的点说下。

捕获非线性的依赖的能力：

接下来看下怎么将互信息运用到 GAN 和 Bi-GAN 上（具体的实验结果就不贴了，肯定是更好了，具体可以看 paper 或者我的 paper reading Link）

GAN上的应用：

Bi-GAN上的应用：

参考文献

[1] 极市平台：深度学习三大谜团：集成、知识蒸馏和自蒸馏

[2] Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.

[3] Chen T, Kornblith S, Swersky K, et al. Big self-supervised models are strong semi-supervised learners[J]. Advances in neural information processing systems, 2020, 33: 22243-22255.

[4] Belghazi M I, Baratin A, Rajeshwar S, et al. Mutual information neural estimation[C]//International conference on machine learning. PMLR, 2018: 531-540.

[5] Allen-Zhu Z, Li Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning[J]. arXiv preprint arXiv:2012.09816, 2020.

更多阅读