Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization
翻译:人类通过听、讲、写、读、读以及通过与多式联运现实世界的互动学习语言。 现有的语言培训前框架展示了只有文本的自我监督观点的有效性,同时我们探索了本文中视觉监督的语言模式。 我们发现,阻碍这一探索的主要因素是视觉背景语言数据集与纯语言整体体之间在规模和分布上的巨大差异。 因此,我们开发了一种名为“vokenization”的技术,通过从背景角度绘制相关图像(我们称之为“vokens ”)的语言符号对只使用语言的数据进行多式调整。 “ vokenizer”在相对较小的图像说明数据集上接受了培训,然后我们应用它来生成大语言整体体的Vokens 。 通过这些背景生成的Vokens,我们的视觉监督语言模型显示,在多种纯语言任务上,如GLUE、SQuAD和SWAG. 代码和事先经过培训的模型上,在https://github.com/airplay/kenization上不断改进。