Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.
翻译:由于最近出现了以文字为条件的隐性扩散模型,公众对可控图像与用户刻录器的图像合成产生了极大的兴趣。用户刻录器控制着颜色构成,而文本则对图像整体语义进行控制。然而,我们注意到,这一方向的先前工作存在一个内在的域位转移问题,其中生成的输出往往缺乏细节,而且与目标域的简单描述相似。在本文件中,我们提出了一个新颖的指导图像合成框架,通过将输出图像建模作为限制优化问题的解决方案来解决这一问题。我们显示,在计算优化的确切解决方案时,无法做到准确的颜色组成,而同一解决方案的近似可以实现,而只是要求一次反向扩散进程。此外,我们通过简单地界定输入文本符号和用户中风调之间的交叉注意通信,表明用户也能控制不同油漆区域的语义,而无需任何有条件的培训或微调。人类用户研究结果显示,拟议的方法在总体用户满意度 AM/BIA AL 1 的纸张上超过了85.32%。