通过潜在空间的对比损失最大限度地提高相同数据样本的不同扩充视图之间的一致性来学习表示。对比式自监督学习技术是一类很有前途的方法,它通过学习编码来构建表征,编码使两个事物相似或不同

VIP内容

背景:实际应用中,由于数据采集和传输过程的复杂性,数据可能会丢失部分视图,这就导致了信息不完备下的视图缺失问题(Incomplete Multi-view Problem, IMP)。例如在线会议中,一些视频帧可能由于传感器故障而丢失了视觉或音频信号。针对该问题,过去十多年已提出了一些不完全多视图聚类方法(Incomplete Multi-view Clustering, IMC)并取得了显著效果。但IMP仍面临两个主要挑战:1)如何在不利用标签信息的情况下学习一致的多视图公共表示;2)如何从部分缺失的数据中还原完整的数据。

http://pengxi.me/wp-content/uploads/2021/03/2021CVPR-completer.pdf

创新:针对上述挑战,受近期Tsai等在ICLR2021上发表的工作所启发,本文提供了一个新的不完全多视图聚类见解,即不完全多视图聚类中的数据恢复和一致性学习是一体两面的,两者可统一到信息论的框架中。这样的观察和理论结果与现有的将一致性学习和数据恢复视为两个独立问题的工作有很大的不同。简要地,从信息论角度出发,互信息能用于量化跨视图表示间的一致性,而条件熵可用于量化跨视图的可恢复性。因此,一方面,最大化互信息与最小化条件熵将分别增加共享的信息量与数据的可恢复性。另一方面,同时最大化互信息与最小化条件熵两个目标又互为补充,相互促进。与Tsai等人的工作的不同之处在于,他们主要是在信息论框架下利用预测学习改进对比学习的性能,没有如本文一样考虑到缺失视图下的一致性和可恢复性的学习。

方法:基于上述观察,论文提出了对偶预测范式并将其与对比学习结合,通过一个新的损失函数实现了跨视图一致性与可恢复性的联合优化。提出的损失函数包括三部分:1)视图内重构损失,主要用于学习各个视图数据的视图特殊表示,由一系列独自的自编码器重构损失组成;2)跨视图对比学习损失,通过最大化不同视图间的互信息学习多视图一致性;3)跨视图对偶预测损失,通过最小化视图表示的条件熵进而实现视图数据恢复。

成为VIP会员查看完整内容
0
11

热门内容

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

0
15
下载
预览

最新内容

Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local-local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global-local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves significant improvement over the state-of-the-art results in various downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on multiple video datasets and backbones. We also demonstrate significant improvement in fine-grained action classification for visually similar classes. With the commonly used 3D ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval.

0
0
下载
预览

最新论文

Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local-local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global-local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves significant improvement over the state-of-the-art results in various downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on multiple video datasets and backbones. We also demonstrate significant improvement in fine-grained action classification for visually similar classes. With the commonly used 3D ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval.

0
0
下载
预览
Top