Multimodal learning pipelines have benefited from the success of pretrained language models. However, this comes at the cost of increased model parameters. In this work, we propose Adapted Multimodal BERT (AMB), a BERT-based architecture for multimodal tasks that uses a combination of adapter modules and intermediate fusion layers. The adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. During the adaptation process the pre-trained language model parameters remain frozen, allowing for fast, parameter-efficient training. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise. Our experiments on sentiment analysis with CMU-MOSEI show that AMB outperforms the current state-of-the-art across metrics, with 3.4% relative reduction in the resulting error and 2.1% relative improvement in 7-class classification accuracy.
翻译:多式学习管道从经过培训的语言模型的成功中获益。然而,这是以增加模型参数为代价的。在这项工作中,我们提议采用适应型多式BERT(AMB),这是一个基于BERT的多式任务结构,使用调适器模块和中间聚变层的组合。适应器为手头的任务调整了经过培训的语言模型,而聚合层则用文字BERT的演示方式进行特定任务、层次和层次的视听信息的融合。在适应过程中,经过培训的语文模型参数仍然冻结,允许快速、有参数效率的培训。在我们的推理中,我们发现这一方法导致高效模型,能够超越经过精细调整的对应方,并能够强有力地输入噪音。我们与CMU-MOSEI的情绪分析实验表明,AMB超越了目前全计量的状态,结果错误相对减少3.4%,7级分类精确度相对提高2.1%。