L-Verse:图像和文字之间的双向生成 (L-Verse: Bidirectional Generation Between Image and Text)

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalabilty. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for text-to-image and image-to-text generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial results of bidirectional vision-language representation learning on general domain.

翻译：除了学习自然语言的长距离互动外,变压器还大大超越了学习自然语言的长程互动,正在成为许多视觉任务及其功率和伸缩性能的脱形标准。特别是在图像和文本之间的跨模式任务下,矢量量量的变异自动转换器(VQ-VAEs)被广泛用于将原始的 RGB 图像转化为一系列地貌矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量矢量的图像图像图像和文字变异自动转换器(AugVAE)和双向自动递增变变变变器(BiART)组成的新结构。我们的AGOVAE展示了在图像Net1K验证器上的最新重建业绩,同时展示了野外图像的坚固性。与其他模型不同,BiART可以将图像(或文字)区分为有条件的参考和生成目标目标。 L-VERS可以直接用于文本或文字变动性自动转换的图像转换工具,在以往的图像生成和图像结构测试中不作任何微调或超级的图像图像图像的图像测试框架中, 。

相关内容

自编码器

关注 140

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

39+阅读 · 2020年11月20日

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日