This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.
翻译:本文是一份技术报告,以分享我们建立韩国和英语双语多式联运模式的经验和结论。许多多式联运数据集侧重于英语和多语言多式联运研究,使用机器翻译的文本,但使用机器翻译的文本仅限于描述独特的表达方式、文化信息以及英文以外其他语文的适当名词。在这项工作中,我们收集了11亿对图像文本(7.08亿韩文和4.76亿英文),并培训了一个名为KELIP的双语多式联运模式。我们引入了简单而有效的培训计划,包括MAE培训前和多项目扩展。广泛的实验表明,经过这类培训计划培训的模型表现出两种语文的竞争性表现。此外,我们讨论与多式联运有关的研究问题:(1) 强大的增强型方法可以分散模型学习适当的多式联运关系;(2) 没有跨语言关系的培训模式可以通过视觉语义学来学习这种关系;(3) 我们的双语KELIP能够从视觉语义中捕捉到文化差异;(4) 一个大型多式联运模型可以用来进行模拟。我们希望这项工作能为未来的研究提供有益的经验和研究结果。我们提供开放源前学习的KIP。