We introduce a new task - Interactive Image Editing via conversational language, where users can guide an agent to edit images via multi-turn dialogue. In each dialogue turn, the agent takes a source image and a natural language description as the input, and generates a modified image following the textual description. Two new datasets are introduced for this task (Zap-Seq and DeepFashion-Seq), which contain multi-turn dialog sessions with crowdsourced image-description sequences. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the given textual description; 2) step-by-step region-level modification to maintain visual consistency across the image sequence. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Network (SeqAttnGAN) framework, which applies a neural state tracker to encode the previous image and the textual description in each dialogue turn, and uses a GAN framework to generate a modified version of the image that is consistent with the dialogue context and preceding images. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. Experiments on Zap-Seq and DeepFashion-Seq datasets show that the proposed SeqAttnGAN model outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics on visual quality, image sequence coherence and text-image consistency.
翻译:我们引入了一个新任务 - 通过对话语言的交互式图像编辑, 用户可以在其中指导一个代理机构通过多方向对话框编辑图像。 每次对话框转折时, 代理机构将源图像和自然语言描述作为输入, 并在文本描述后生成一个修改图像。 为此任务引入了两个新的数据集( Zap- Seq 和 DeepFashion-Seq), 包含包含由多方源图像描述序列组成的多点对话会话。 此相继和交互式图像生成任务的主要挑战有两重:1) 生成图像和给定文本描述之间的背景一致性; 2) 一步一步步的直观级修改, 以保持图像序列之间的视觉一致性。 为了应对这些挑战, 我们提出了一个新的序列关注 引力 Adversarial 网络 (SeqAtnGAN) 框架, 该框架将神经状态跟踪器用于对先前图像和文字描述的编码, 并使用 GAN 框架生成一个与对话框背景和前图像相一致的修改版本; 2) 一步步步的图像质量修改, 以更清晰的方式对图像进行更清晰的精确的精确的精确的精确度调整, 我们将SASqreareal- sqreal- sqreal