Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality, and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification (MOC), Masked Region Phrase Generation (MRPG), Image-Sentence Matching (ISM), and Masked Sentence Generation (MSG). In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
翻译:与主要学习单一通用编码器的现有工作不同,我们展示了一个预先培训通用编码器网络(Uni-EDEN),以方便视觉语言感知(例如,视觉问题回答)和生成(例如,图像说明)和生成(例如,视觉问题解说)。 Uni-ENDEN是一个双流变换器结构,由三个模块组成:分别学习每种模式的表述方式的客体和判刑编码器,以及通过现代互动使多模式推理和生成刑罚的句码。考虑到每种图像的语言表达方式可以跨越这一等级的不同微小,包括从简单到全面、单个标签、短语和自然句,我们通过多语种愿景语言的替代任务,即:遮掩的物件分类(MOC),遮掩的区域精度生成模型(MRMGGG)、图像-SDIMS-MM(SMDRM)和MIC-IMUDRMY(SMY-DRMY IMUDR IML IMU-D IMUD IMUD IML IML IML IMU-D IMUD IMBY IMBY IML IMBY IMBY IML IMBY IMB IMBY IMBY IMBY IMB IMBY IMB IMBY IMLI) 和G 和BY IMB IMB IMB IMB IMB IMBY IMBY IMLY 方式, IMBY 和MY IMBY IMBY IMBY IMB IML IMB IMB IMB IML IML IML IML IML IML IML IML IML IML IML IML IML IML IMB IMB IMB IMB IMB IMB IMB IMB IMB IMB 方式, IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM IM