复杂多样性的对立学习 (Contrastive Learning with Complex Heterogeneity)

With the advent of big data across multiple high-impact applications, we are often facing the challenge of complex heterogeneity. The newly collected data usually consist of multiple modalities and are characterized with multiple labels, thus exhibiting the co-existence of multiple types of heterogeneity. Although state-of-the-art techniques are good at modeling complex heterogeneity with sufficient label information, such label information can be quite expensive to obtain in real applications. Recently, researchers pay great attention to contrastive learning due to its prominent performance by utilizing rich unlabeled data. However, existing work on contrastive learning is not able to address the problem of false negative pairs, i.e., some `negative' pairs may have similar representations if they have the same label. To overcome the issues, in this paper, we propose a unified heterogeneous learning framework, which combines both the weighted unsupervised contrastive loss and the weighted supervised contrastive loss to model multiple types of heterogeneity. We first provide a theoretical analysis showing that the vanilla contrastive learning loss easily leads to the sub-optimal solution in the presence of false negative pairs, whereas the proposed weighted loss could automatically adjust the weight based on the similarity of the learned representations to mitigate this issue. Experimental results on real-world data sets demonstrate the effectiveness and the efficiency of the proposed framework modeling multiple types of heterogeneity.

翻译：由于在多个高影响应用中出现大数据,我们往往面临复杂的异质性挑战。新收集的数据通常由多种模式组成,具有多重标签特征,因此显示多种异质性共存。尽管最先进的技术在建模复杂的异质性与足够的标签信息方面十分出色,但在实际应用中,这种标签信息可能非常昂贵。最近,研究人员非常关注对比性学习,因为其使用丰富的无标签数据的表现显著。然而,现有的对比性学习工作无法解决假负对子的问题,即一些“负对子”的对子如果有相同的标签,可能具有相似的表达方式。为了克服这些问题,我们在本文件中提出了一个统一的混合学习框架,将加权的、非超大的对比性损失和加权监督的对比性损失与模型的多种异质性差异性损失结合起来。我们首先提供理论分析,表明香性对比性学习损失的现有工作无法轻易解决假负对等负对对等的对等问题,即一些“负对等”配对可能具有相似的表达方式。为了克服这些问题,我们建议一个统一的混合的混合学习框架,而这种模拟性模拟性模型的模型显示的是,在模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟性能、模拟、模拟、模拟、模拟性能、模拟性能、模拟性能、模拟、模拟、模拟、模拟性、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟、模拟