Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions.
翻译:深层次学习应用范围广泛,近年来越来越受欢迎,多式深层次学习(MDL)的目标是创建能够利用各种方式处理和连接信息的模型,尽管为单一方式学习进行了广泛的发展,但仍然无法涵盖人类学习的所有方面,多模式学习有助于在各种感官参与信息处理时更好地理解和分析,本文侧重于多种模式,即图像、视频、文字、音频、身体手势、面部表情和生理信号,详细分析了基线方法,深入研究了过去5年(2017年至2021年)在多模式深层次学习应用方面的最新进展,提出了各种多式深层次学习方法的精细分类,更深入地阐述了不同的应用,最后,分别强调了每个领域的主要问题,以及它们未来的可能研究方向。