Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.
翻译:关注型模型对多式联运具有吸引力,因为多种模式的投入可以混在一起并输入到单一的主干网中,因此需要很少的融合工程。由此产生的表述方式在整个网络中完全缠绕在一起,但不一定总是可取的:在学习中,有对比的视听自我监督学习需要独立的视听特征才能运行,否则则学习崩溃;推断,对视听模型的评价应当有可能建立在仅具有音频或公正视频的基准之上。在本文件中,我们引入了Zorro,这是一种使用遮罩来控制每种模式的投入如何在变换器中路由,保留代表模式的某个部分。我们将这一技术应用于三种流行的变压器结构(ViT、Swin和HiP),并表明通过对比性培训前Zorro在最相关的多式任务基准(AudioSet和VGGGSound)上取得了最新的结果。此外,由此产生的模型能够对Kinetics-400或ESC-50等视频和音频基准进行非模式的推断。