自监督多模态学习：一项综述 (Self-Supervised Multimodal Learning: A Survey)

Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to leverage supervision from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, which we categorize along three orthogonal axes: objective functions, data alignment, and model architectures. These axes correspond to the inherent characteristics of self-supervised learning methods and multimodal data. Specifically, we classify training objectives into instance discrimination, clustering, and masked prediction categories. We also discuss multimodal input data pairing and alignment strategies during training. Finally, we review model architectures including the design of encoders, fusion modules, and decoders, which are essential components of SSML methods. We review downstream multimodal application tasks, reporting the concrete performance of the state-of-the-art image-text models and multimodal video models, and also review real-world applications of SSML algorithms in diverse fields such as healthcare, remote sensing, and machine translation. Finally, we discuss challenges and future directions for SSML. A collection of related resources can be found at: https://github.com/ys-zong/awesome-self-supervised-multimodal-learning.

翻译：多模态学习旨在理解和分析来自多种模态的信息，近年来在监督范式下取得了显著进展。然而，大量数据和昂贵的人工注释依赖性阻碍了模型的扩展。同时，自监督学习是减轻注释困境的一种有吸引力的策略，借助自然的大规模未注释数据。在这两个方向上，自监督多模态学习（SSML）提供了利用原始多模态数据的自我监督形式。在本文中，我们概述了SSML的最新进展，按照三个正交的维度进行分类：目标函数、数据对齐和模型架构。这些维度对应于自监督学习方法和多模态数据的固有特性。具体而言，我们将训练目标分类为实例鉴别、聚类和掩码预测类别。我们还讨论了训练期间的多模态输入数据组合和对齐策略。最后，我们回顾了模型结构，包括编码器、融合模块和解码器的设计，这些都是SSML方法的关键组成部分。我们回顾了下游多模态应用任务，报告了最先进的图像-文本模型和多模态视频模型的表现，并回顾了SSML算法在医疗保健，遥感和机器翻译等不同领域的实际应用。最后，我们讨论SSML的挑战和未来方向。您可以在以下链接中查找相关资源：https://github.com/ys-zong/awesome-self-supervised-multi-modal-learning。