Among ubiquitous multimodal data in the real world, text is the modality generated by human, while image reflects the physical world honestly. In a visual understanding application, machines are expected to understand images like human. Inspired by this, we propose a novel self-supervised learning method, named Text-enhanced Visual Deep InfoMax (TVDIM), to learn better visual representations by fully utilizing the naturally-existing multimodal data. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views of a shared context to a rational degree. Different from previous methods which only consider multiple views from a single modality, our work produces multiple views from different modalities, and jointly optimizes the mutual information for features pairs of intra-modality and inter-modality. Considering the information gap between inter-modality features pairs from data noise, we adopt a \emph{ranking-based} contrastive learning to optimize the mutual information. During evaluation, we directly use the pre-trained visual representations to complete various image classification tasks. Experimental results show that, TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
翻译:在现实世界中无处不在的多式联运数据中,文本是人类生成的模式,而图像则诚实地反映物理世界。在视觉理解应用中,机器预期会理解像人类这样的图像。受此启发,我们提出一种新的自我监督学习方法,名为“TVDIM ” (TVDIM ),通过充分利用自然存在的多式联运数据来学习更好的视觉表现。我们自我监督学习的核心理念是尽量扩大从多种观点中提取的共享背景到合理程度的相互信息。与以往的方法不同,以往的方法只考虑一种模式的多重观点,我们的工作产生不同模式的多种观点,共同优化内部模式和相互模式的特征对等的相互信息。考虑到数据噪声中不同模式对对等的信息差距,我们采用\emph{排名}对比学习来优化相互信息。在评估期间,我们直接使用事先培训的视觉表现来完成各种图像分类任务。实验结果显示,在先前的视觉图像处理中,TVDIM 明显超越了先前的自我监督方法。