This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.
翻译:本文提出了一种根据用户指定的定制对象生成图像的方法。该方法基于一般框架,绕过先前方法所需的漫长优化,这些方法通常采用每个对象优化范例。我们的框架采用编码器来捕捉对象高可识别语义,只需一个前向传递即可生成具有单独对象嵌入的定制化对象图片。然后,所获取的对象嵌入被传递给文本到图像合成模型进行后续生成。为了在相同的生成环境中有效地融合基于对象的嵌入空间到经过完善的文本到图像模型中,我们研究了不同的网络设计和训练策略,并提出了一个简单而有效的正则化联合训练方案,其中包括维护对象身份的衍生嵌入的标识损失。此外,我们提出了一种字幕生成方案,该方案在促进对象特定嵌入到生成过程中被忠实反映的同时保持了控制和编辑能力。一旦训练完成,该网络就能够产生多样的内容和风格,这些内容和风格与文本和对象的条件有关。我们通过实验证明,我们提出的方法能够生成具有引人注目的输出质量、外观多样性和对象忠实度的图像,无需测试时优化。我们还进行了系统研究,分析了我们的模型,为未来的工作提供了洞察力。