多模式共同学习:挑战、数据集应用、最新进展和未来方向 (Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions)

Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. Multimodal machine learning involves multiple aspects: representation, translation, alignment, fusion, and co-learning. In the current state of multimodal machine learning, the assumptions are that all modalities are present, aligned, and noiseless during training and testing time. However, in real-world tasks, typically, it is observed that one or more modalities are missing, noisy, lacking annotated data, have unreliable labels, and are scarce in training or testing and or both. This challenge is addressed by a learning paradigm called multimodal co-learning. The modeling of a (resource-poor) modality is aided by exploiting knowledge from another (resource-rich) modality using transfer of knowledge between modalities, including their representations and predictive models. Co-learning being an emerging area, there are no dedicated reviews explicitly focusing on all challenges addressed by co-learning. To that end, in this work, we provide a comprehensive survey on the emerging area of multimodal co-learning that has not been explored in its entirety yet. We review implementations that overcome one or more co-learning challenges without explicitly considering them as co-learning challenges. We present the comprehensive taxonomy of multimodal co-learning based on the challenges addressed by co-learning and associated implementations. The various techniques employed to include the latest ones are reviewed along with some of the applications and datasets. Our final goal is to discuss challenges and perspectives along with the important ideas and directions for future work that we hope to be beneficial for the entire research community focusing on this exciting domain.

翻译：与单个模式(即单式)系统相比,采用文字、图像、音像、视频等多种模式的多式深层次学习系统表现较好。多式机器学习涉及多个方面:代表性、翻译、调整、融合和共同学习。在目前多式联运机学习的状态中,假设所有模式在培训和测试期间都是存在、调整和无噪音的。然而,在现实世界的任务中,通常发现一种或多种模式缺失、吵闹、缺乏附加说明的数据、标签不可靠、在培训或测试中或两者都缺乏。这一挑战通过称为多式联运共同学习的学习模式来解决。(资源贫乏)模式模式的模型利用另一种(资源丰富)模式的知识,在各种模式之间的知识转让,包括其表述和预测模型之间,是有所助益的。在一个新兴领域,没有专门审查任何相关的审查,以共同学习为基础应对所有挑战。在这项工作中,我们提供了对正在形成的、在培训或测试中采用的一种模式方面挑战的全面调查。我们没有在全面研究中明确探讨这种技术,正在共同研究如何克服当前各种挑战。