Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.
翻译:作为学习将不同模式及其相互关系的信息嵌入信息的一种方法,多模式代表性学习作为一种学习技术,在各种应用方面取得了显著成功,例如视觉问答、视觉理性自然语言和视觉语言检索等。在这些应用中,不同模式的跨模式互动和补充信息对于先进的模型执行任何多式联运任务至关重要,例如,理解、承认、检索或最佳生成。研究人员提出了处理这些任务的多种方法。基于变压器结构的不同变式在多种模式上表现得特别出色。本调查介绍了关于发展和加强深层次学习的多式联运结构的全面文献,以处理不同模式和现代多式联运任务的文字、视觉和听觉特征。本研究总结了(一) 最新具体任务的深层次学习方法,(二) 培训前类型和多模式培训前目标,(三) 从最先进的多模式方法到统一结构,以及(四) 多式联运任务类别和今后可能的改进,为更好的多式联运研究、最后研究专题,我们为改进的升级/培训前研究准备了最新标准。