Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
翻译:人类通过同时处理和屏蔽视觉和音频等多种模式的高维投入对世界感知。 最鲜明的对比是,机器感知模型一般是特定模式,对单一模式基准最优化,因此,对每种模式的最后表述或预测(“最新融合”)的后期整合仍然是多式联运视频分类的主要模式。 相反,我们引入了一个新的变压器结构,在多层模式融合中使用“聚合瓶颈”。与传统的对称自我意识相比,我们的模型迫使不同模式之间的信息通过少量瓶颈潜伏,要求模型整理和压缩每种模式中最相关的信息,并仅分享必要的信息。我们发现,这种战略可以提高聚合性能,同时降低计算成本。我们进行彻底的反向研究,并在多种视听分类基准(包括音频集、Epic-Kitchens和VGGSound)上取得最新的结果。所有代码和模型都将被发布。