Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.
 翻译:视觉和语言任务在研究界越来越受欢迎,但重点仍然主要在英语上。我们建议利用英语的视觉语言模型来培训目标语言的单一语言模型。我们建议扩展OSCAR+,这是一个利用对象语言标记作为学习图像文本校正的锚点的模式,用于以不同语言进行视觉问题解答数据集的培训。我们建议采用新颖的知识蒸馏方法,用平行句子用其他语言培训模型。与培训前公司使用目标语言的其他模型相比,我们可以利用现有的英语模型将知识转让给目标语言,而使用的资源要少得多得多。我们还提出一个大规模视觉问题,回答日语和印地语数据集。虽然我们的工作仅限于直观回答,但我们的模型可以扩大到任何顺序层次的分类任务,也可以扩大到其他语言。本文侧重于视觉回答任务的两种语言-日语和印地语。我们的管道比当前最新设计模型的准确性分别增长了4.4%和13.4%。