Since visual perception can give rich information beyond text descriptions for world understanding, there has been increasing interest in leveraging visual grounding for language learning. Recently, vokenization has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. Despite its success, the method suffers from approximation error of using finite image labels and the lack of vocabulary diversity of a small image-text dataset. To overcome these limitations, we present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. To avoid approximation error, we propose to use different knowledge distillation objectives. In addition, the use of a large-scale video-text dataset helps learn diverse and richer vocabularies. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQuAD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models. Our code and models are available at: https://github.com/zinengtang/VidLanKD
翻译:由于视觉感知可以提供超出文字描述之外的丰富信息供世界理解,人们越来越有兴趣利用视觉基础来进行语言学习。最近,通过将文字到图像检索模型的预测作为语言模型监督的标签而吸引人们注意。尽管这种方法取得了成功,但使用有限图像标签的近似错误和缺乏一个小型图像-文本数据集的词汇多样性。为了克服这些限制,我们介绍了VidLanKD,这是一种提高语言理解的视频语言知识蒸馏方法。我们用视频文本数据集培训多模式教师模型,然后将其知识转让给一个带有文本数据集的学生语言模型。为了避免近似错误,我们提议使用不同的知识蒸馏目标。此外,使用大型视频文本数据集有助于学习多样化和更加丰富的微量图像数据集。为了克服这些限制,我们进行了实验,VidLanKDD实现了对只使用文字的语言模型和语音模型的不断改进。我们用GLUE、SQUAA和SWAAAAG的物理推理学和GRUA数据能力,我们还以学习的地理推理学和GUA。