We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2D semantic canvas with precise shapes. More specifically, the input layout consists of one or more semantic regions with free-form text descriptions and adjustable precision levels, which can be set based on the desired controllability. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. By supporting the levels in-between, our framework is flexible in assisting users of different drawing expertise and at different stages of their creative workflow. We introduce several novel techniques to address the challenges coming with this new setup, including a pipeline for collecting training data; a precision-encoded mask pyramid and a text feature map representation to jointly encode precision level, semantics, and composition information; and a multi-scale guided diffusion model to synthesize images. To evaluate the proposed method, we collect a test dataset containing user-drawn layouts with diverse scenes and styles. Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods. Project page \url{https://zengxianyu.github.io/scenec/}
翻译:我们提出了一个从任何精确度的语义布局进行有条件图像合成的新框架,范围从纯文本到具有精确形状的2D语义画布局。更具体地说,输入布局由一个或多个具有自由格式文本描述和可调整精确度的语义区域组成,这些区域可以按照预期的可控性加以设定。这个框架自然地降低为最低层次的文字到图像(T2I),没有形状信息,并成为最高层次的分解到图像(S2I)。通过支持中间层次,我们的框架灵活地帮助不同绘图专门知识的用户和在他们创造工作流程的不同阶段。我们采用了几种新的技术来应对这个新设置的挑战,包括收集培训数据的管道;精确编码的面具金字塔和一个文本特征图,以联合编码精确度、语义学和组成信息;以及综合图像的多尺度制导扩散模型。为了评估拟议的方法,我们收集了含有不同图像和风格的用户-拖动布局的测试数据集。 实验性结果显示现有方法能够根据不同图像/风格/风格生成高精度的图像。