Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions.
翻译:本地化描述是一个数据集,包含详细的自然语言描述图象的数据集,这些图象配有鼠标痕迹,为词组提供稀疏、细微的视觉地面。我们提议了TRECS,这是一个利用这种基底来生成图像的相继模型。TRECS使用描述来检索分离面罩,并预测与鼠标痕相匹配的物体标签。这些校正用于选择和定位遮罩以生成一个完全覆盖的分层画布;最后图像由使用此画布的分层到图像生成器生成。这种多步骤的、基于检索的方法在自动测量和人类评估上都比现有的直接文本到图像生成模型更完善:总体而言,其生成的图象更具有照片现实性,更符合描述。