Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies. Our code is available at https://github.com/ziwenwang28/VarContrast.
翻译:对比学习通过拉近相似样本、推远不相似样本,已被证明在多种模态的表征空间构建中具有高效性和适应性。然而,其仍存在两个关键局限:(1) 若无嵌入分布的显式调控,语义相关的实例可能被无意推远,除非有辅助信号指导样本对选择;(2) 过度依赖大批量负样本和定制化数据增强会阻碍泛化能力。为解决这些局限,我们提出变分监督对比学习(VarCon),将监督对比学习重新表述为对潜在类别变量的变分推断,并最大化一个后验加权的证据下界(ELBO)。该方法以高效的类感知匹配替代了详尽的成对比较,并赋予嵌入空间内类内离散度的细粒度控制。仅在图像数据上训练,我们在CIFAR-10、CIFAR-100、ImageNet-100和ImageNet-1K上的实验表明,VarCon (1) 在对比学习框架中实现了最先进的性能,使用ResNet-50编码器在ImageNet-1K上达到79.36%的Top-1准确率,在CIFAR-100上达到78.29%,且仅需200个训练周期即可收敛;(2) 通过KNN分类、层次聚类结果和迁移学习评估证明,其在嵌入空间中产生了更清晰的决策边界和语义组织结构;(3) 在小样本学习任务中表现出优于监督基线的性能,并在多种数据增强策略下展现出更强的鲁棒性。代码发布于https://github.com/ziwenwang28/VarContrast。