As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper provides an information-theoretical framework to better understand the properties that encourage successful self-supervised learning. Specifically, we demonstrate that self-supervised learned representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design. In particular, we propose a composite objective that bridges the gap between prior contrastive and predictive learning objectives, and introduce an additional objective term to discard task-irrelevant information. To verify our analysis, we conduct controlled experiments to evaluate the impact of the composite objectives. We also explore our framework's empirical generalization beyond the multi-view perspective, where the cross-view redundancy may not be clearly observed.
翻译:作为未经监督的代表学习的子集,自我监督的代表学习采用自我定义的信号作为监督,并使用知识化的代表方式进行下游任务,例如物体探测和图像说明。许多自监督学习的拟议方法自然遵循多视角的观点,其中输入(例如原始图像)和自监督的信号(例如增强图像)可被视为数据的两个冗余观点。从这一多视角出发,本文件提供了一个信息理论框架,以更好地了解鼓励成功自我监督学习的属性。具体地说,我们证明自监督的学习表达方式可以提取与任务有关的信息并抛弃与任务有关的信息。我们的理论框架为扩大自我监督学习目标设计的空间铺平了道路。特别是,我们提出了一个综合目标,以弥合先前对比和预测学习目标之间的差距,并引入一个额外的客观术语来抛弃与任务有关的信息。为了核实我们的分析,我们进行了有控制的实验,以评价与任务有关的信息的影响。我们还明确探索了我们框架在多视角之外的经验性一般观点,在多视角上可能没有观察到的多视角。