With the advent of big data across multiple high-impact applications, we are often facing the challenge of complex heterogeneity. The newly collected data usually consist of multiple modalities and characterized with multiple labels, thus exhibiting the co-existence of multiple types of heterogeneity. Although state-of-the-art techniques are good at modeling the complex heterogeneity with sufficient label information, such label information can be quite expensive to obtain in real applications, leading to sub-optimal performance using these techniques. Inspired by the capability of contrastive learning to utilize rich unlabeled data for improving performance, in this paper, we propose a unified heterogeneous learning framework, which combines both weighted unsupervised contrastive loss and weighted supervised contrastive loss to model multiple types of heterogeneity. We also provide theoretical analyses showing that the proposed weighted supervised contrastive loss is the lower bound of the mutual information of two samples from the same class and the weighted unsupervised contrastive loss is the lower bound of the mutual information between the hidden representation of two views of the same sample. Experimental results on real-world data sets demonstrate the effectiveness and the efficiency of the proposed method modeling multiple types of heterogeneity.
翻译:由于在多个影响较大的应用中出现大数据,我们往往面临复杂的异质性挑战。新收集的数据通常由多种模式组成,并带有多种标签,因此显示多种异质性同时存在。尽管最先进的技术在建模复杂的异质性与足够的标签信息方面十分出色,但在实际应用中,这种标签信息可能非常昂贵,导致使用这些技术的性能低于最佳水平。受对比学习能力的影响,利用丰富的无标签数据改进性能。在本文中,我们提出了一个统一的异质学习框架,将加权的、非超优的对比性损失和加权监督的对比性损失结合起来,以模拟多种异质性。我们还提供理论分析,表明拟议的加权监督性对比性损失是同一类别两个样本的相互信息的较低结合,而加权的不优劣对比性损失是同一样本两种观点的隐蔽性表述之间的相互约束较低。在现实世界数据集上的实验结果显示了拟议方法的多种类型的有效性和有效性。