In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
翻译:在本文中,我们建议用视觉、文字和音频资源联合建模,进行跨模式理解和生成的Omni-eption Pre-Trainer(OPT),以进行跨模式理解和生成;在编解码器-解码器框架中,包括三个单一模式编码器,以生成每种模式的象征性嵌入器;一个交叉模式编码器,以编码三种模式之间的相互关系;以及两个交叉模式解码器,分别生成文本和图像;在编解训练前,我们设计了一个多任务托辞学习计划,以建模来自三种不同数据颗粒的多模式资源,即:\ie、象征性、模式和样本级的模型,通过这些模型,方阵列方学会在不同模式之间进行协调和翻译;培训前的任务是用大量开放图像-文字三重来完成;实验结果显示,被占领土可以学习强大的图像-文字-多模式的多模式展示,并在各种跨模式和一代任务上取得有希望的成果。