The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.
翻译:语言领域的变换者最近的成功促使它适应了多式联运环境,在这种环境下,一种新的视觉模型与已经经过预先培训的语言模型同时接受培训;然而,由于变换者的过度记忆要求,现有工作通常会修正语言模型,只培训愿景模块,从而限制其以端对端方式学习跨模式信息的能力。在这项工作中,我们侧重于在视听视频演示学习方面减少多模式变换者的参数。我们通过分享不同层次和模式的变换者的参数,减轻了高记忆要求;我们将变换器分解成具体模式和共享模式的部分,以便模型能够单独和一起学习每种模式的动态,并提议一个以低级别近似速为基础的新的参数共享计划。我们表明我们的方法将变换器的参数降低到97美元,从而使我们能够从零到零地训练我们的模型的端到端。我们还提出了一种负面的抽样方法,其依据是在CNN嵌入空间上测量到我们模型与变换者一起学习的类似实例;我们将变换器转换器转换成一个特定和模式共享的部分,以便模型能够展示我们从30-7年的变换式模型到30年的变换模型。