现实世界中的信息通常以不同的模态出现。例如,图像通常与标签和文本解释联系在一起;文本包含图像以便更清楚地表达文章的主要思想。不同的模态由迥异的统计特性刻画。例如,图像通常表示为特征提取器的像素强度或输出,而文本则表示为离散的词向量。由于不同信息资源的统计特性不同,发现不同模态之间的关系是非常重要的。多模态学习是一个很好的模型,可以用来表示不同模态的联合表示。多模态学习模型也能在观察到的情况下填补缺失的模态。多模态学习模型中,每个模态对应结合了两个深度玻尔兹曼机(deep boltzmann machines).另外一个隐藏层被放置在两个玻尔兹曼机上层,以给出联合表示。

VIP内容

摘要:大数据是多源异构的。在信息技术飞速发展的今天,多模态数据已成为近来数据资源的主要形式。研究多模态学习方法,赋予计算机理解多源异构海量数据的能力具有重要价值。本文归纳了多模态的定义与多模态学习的基本任务,介绍了多模态学习的认知机理与发展过程。在此基础上,重点综述了多模态统计学习方法与深度学习方法。此外,本文系统归纳了近两年较为新颖的基于对抗学习的跨模态匹配与生成技术。本文总结了多模态学习的主要形式,并对未来可能的研究方向进行思考与展望。

成为VIP会员查看完整内容
0
115

最新内容

Multimodal learning is an emerging yet challenging research area. In this paper, we deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs. Being a fleeting action, which is reflected across the modalities, sarcasm detection is challenging since large datasets are not available for this task in the literature. Therefore, we primarily focus on resource-constrained training, where the number of training samples is limited. To this end, we propose a novel multimodal learning system, MuLOT (Multimodal Learning using Optimal Transport), which utilizes self-attention to exploit intra-modal correspondence and optimal transport for cross-modal correspondence. Finally, the modalities are combined with multimodal attention fusion to capture the inter-dependencies across modalities. We test our approach for multimodal sarcasm and humor detection on three benchmark datasets - MUStARD (video, audio, text), UR-FUNNY (video, audio, text), MST (image, text) and obtain 2.1%, 1.54%, and 2.34% accuracy improvements over state-of-the-art.

0
0
下载
预览

最新论文

Multimodal learning is an emerging yet challenging research area. In this paper, we deal with multimodal sarcasm and humor detection from conversational videos and image-text pairs. Being a fleeting action, which is reflected across the modalities, sarcasm detection is challenging since large datasets are not available for this task in the literature. Therefore, we primarily focus on resource-constrained training, where the number of training samples is limited. To this end, we propose a novel multimodal learning system, MuLOT (Multimodal Learning using Optimal Transport), which utilizes self-attention to exploit intra-modal correspondence and optimal transport for cross-modal correspondence. Finally, the modalities are combined with multimodal attention fusion to capture the inter-dependencies across modalities. We test our approach for multimodal sarcasm and humor detection on three benchmark datasets - MUStARD (video, audio, text), UR-FUNNY (video, audio, text), MST (image, text) and obtain 2.1%, 1.54%, and 2.34% accuracy improvements over state-of-the-art.

0
0
下载
预览
Top