We present M3P, a Multitask Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training into a unified framework via multitask pre-training. Our goal is to learn universal representations that can map objects occurred in different modalities or texts expressed in different languages into a common semantic space. In addition, to explicitly encourage fine-grained alignment between images and non-English languages, we also propose Multimodal Code-switched Training (MCT) to combine monolingual pre-training and multimodal pre-training via a code-switch strategy. Experiments are performed on the multilingual image retrieval task across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
翻译:我们提出多语种多语种多语种多语种多语种多语种培训前培训模式,通过多语种培训前培训,将多语种培训前培训和多语种培训前培训合并为一个统一框架,我们的目标是学习能够以不同模式或不同语言表达的文本绘制物体的通用表述方法,将其映射成一个共同的语义空间,此外,为了明确鼓励图像和非英语之间的细微调整,我们还提议多语种调校准培训模式培训模式(MCT)将单语种培训前培训和多语种培训前培训相结合,通过代码调控战略,对包括MCCO和多语种30K在内的两个基准数据集的多语种图像检索任务进行实验。 M3P可以实现英语的可比结果,为非英语取得新的最新结果。