Since visual perception can give rich information beyond text descriptions for world understanding, there has been increasing interest in leveraging visual grounding for language learning. Recently, vokenization (Tan and Bansal, 2020) has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. Despite its success, the method suffers from approximation error of using finite image labels and the lack of vocabulary diversity of a small image-text dataset. To overcome these limitations, we present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. To avoid approximation error, we propose to use different knowledge distillation objectives. In addition, the use of a large-scale video-text dataset helps learn diverse and richer vocabularies. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQuAD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models. Our code and models are available at: https://github.com/zinengtang/VidLanKD
翻译:由于视觉感知可以提供超越文字描述的丰富信息,让世界理解,因此人们越来越有兴趣利用视觉知识蒸馏方法来利用视觉基础来进行语言学习。最近,通过将文字到图像检索模型的预测作为语言模型监督的标签,伏特加化(Tan和Bansal,2020年)吸引了人们的注意。尽管这种方法取得了成功,但使用有限图像标签的近似错误以及缺乏一个小型图像文本数据集的词汇多样性。为了克服这些限制,我们介绍了VidLanKD,这是一种提高语言理解的视频知识蒸馏方法。我们用视频文本数据集来培训一个多模式教师模型,然后将其知识转让给带有文本数据集的学生语言模型。为避免近似错误,我们提议使用不同的知识提炼目标。此外,使用大型视频文本数据集有助于学习多样化和更加丰富的图像文本数据集。在我们实验中,VidLanKD在只读文本的教学模型和语音模型上取得了一致的改进。我们在下游语言模型上,包括GLUE、SUA和SWLUA的推理学。