Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of past and current baseline approaches and an in-depth study of recent advancements in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth. Architectures and datasets used in these applications are also discussed, along with their evaluation metrics. Last, main issues are highlighted separately for each domain along with their possible future research directions.
翻译:近年来,深入学习应用范围广泛,越来越受欢迎。多式深层次学习的目标是创造能够利用各种方式处理和连接信息的模式。尽管为单一方式学习进行了广泛的发展,但仍然无法涵盖人类学习的所有方面。多模式学习有助于在信息处理中运用各种感官时更好地理解和分析。本文侧重于多种模式,即图像、视频、文本、音频、体力手势、面部表情和生理信号。对过去和目前的基线方法进行了详细分析,并深入研究了多式深层次学习应用的最新进展。提出了各种多式深层次学习应用的精细分类,更深入地阐述不同的应用。还讨论了这些应用中使用的架构和数据集及其评价指标。最后,对每个领域的主要问题及其可能的未来研究方向进行了单独强调。