Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide "how" to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than existing methods, many of which employ massive transformer-based networks.
翻译:由于多式联运数据的多样性,从多种模式(如视频、语音和文本)中有效整合数据具有挑战性。在本文件中,我们提出了适应性融合技术,目的是从不同模式中有效地建模背景。我们不给网络定义决定性融合操作(如连接),而是让网络决定“如何”更有效地结合一套特定的多式联运特征。我们建议了两个网络:(1) 自动整合,它学会从不同模式压缩信息,同时保留背景;(2) GAN-Fusion,它规范了所学的潜在空间,使其与补充模式相结合。 对多式联运机器翻译和情感认知任务进行的数量评估表明,我们的轻量、适应性网络可以比现有方法(其中很多使用大规模变压器网络)更好的模式。