Current language models have been criticised for learning language from text alone without connection between words and their meaning. Consequently, multimodal training has been proposed as a way for creating models with better language understanding by providing the lacking connection. We focus on pre-trained multimodal vision-and-language (VL) models for which there already are some results on their language understanding capabilities. An unresolved issue with evaluating the linguistic skills of these models, however, is that there is no established method for adapting them to text-only input without out-of-distribution uncertainty. To find the best approach, we investigate and compare seven possible methods for adapting three different pre-trained VL models to text-only input. Our evaluations on both GLUE and Visual Property Norms (VPN) show that care should be put into adapting VL models to zero-shot text-only tasks, while the models are less sensitive to how we adapt them to non-zero-shot tasks. We also find that the adaptation methods perform differently for different models and that unimodal model counterparts perform on par with the VL models regardless of adaptation, indicating that current VL models do not necessarily gain better language understanding from their multimodal training.
翻译:现有语言模式被批评为只从文本中学习语言,而没有言词和含义之间的联系。因此,提出了多式联运培训,作为通过提供缺失的连接来创建语言理解程度更好的模式的一种方法。我们侧重于语言理解能力方面已经取得一定成果的经过预先训练的多式联运愿景和语言模式(VL),然而,评价这些模式的语言技能方面一个尚未解决的问题是,没有固定的方法使这些模式适应仅文本输入,而没有分布上的不确定性。为了找到最佳方法,我们调查并比较了使三种经过预先训练的VL模型适应仅文本输入的7种可能方法。我们对GLUE和视觉财产规范的评价表明,应当谨慎地将VL模式适应于零光文本的任务,而模型对于我们如何调整这些模式适应非零光版任务则不太敏感。我们还发现,适应方法对不同的模式和与VL模式相同的非模范的适应方法不同,我们发现,无论适应如何适应,目前的VL模型不一定从多式培训中获得更好的语言理解。