Transformer-based models are widely used in natural language understanding (NLU) tasks, and multimodal transformers have been effective in visual-language tasks. This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders. Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU. After training with a small number of extra adapting steps and finetuned, the proposed XDBERT (cross-modal distilled BERT) outperforms pretrained-BERT in general language understanding evaluation (GLUE), situations with adversarial generations (SWAG) benchmarks, and readability benchmarks. We analyze the performance of XDBERT on GLUE to show that the improvement is likely visually grounded.
翻译:以变换器为基础的模型被广泛用于自然语言理解任务,而多式联运变压器在视觉语言任务中是有效的。本研究探索从经过培训的多式变压器到经过培训的语言编码器的视觉信息蒸馏。我们的框架受跨模式的编码器在视觉任务中的成功启发,同时我们改变学习目标,以适应新式变压器的语言重度特征。在经过少量额外调整步骤和微调的培训后,拟议的XDBERT(跨模式蒸馏变压器)在一般语言理解评价(GLUE)中优于经过培训的BERT(GLUE)、对抗代(SWAG)基准和可读性基准。我们在GLUE上对XDBERT的绩效进行了分析,以显示改进可能是有视觉依据的。