Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability. This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity. With compositionality as the core idea, we first decompose an image into representative factors, and then train a diffusion model with all these factors as the conditions to recompose the input. At the inference stage, the rich intermediate representations work as composable elements, leading to a huge design space (i.e., exponentially proportional to the number of decomposed factors) for customizable content creation. It is noteworthy that our approach, which we call Composer, supports various levels of conditions, such as text description as the global information, depth map and sketch as the local guidance, color histogram for low-level details, etc. Besides improving controllability, we confirm that Composer serves as a general framework and facilitates a wide range of classical generative tasks without retraining. Code and models will be made available.
翻译:在海量数据上所学的近期大规模基因模型能够合成令人难以置信的图像,但控制力有限。这项工作提供了新一代模式,允许灵活控制输出图像,如空间布局和调色板,同时保持合成质量和模型创造力。以构成性为核心理念,我们首先将图像分解成具有代表性的因素,然后用所有这些因素来培训一个传播模型,作为重新配置输入的条件。在推理阶段,丰富的中间代表作为可兼容元素开展工作,导致一个庞大的设计空间(即与可定制内容生成的成份数成指数成比例的指数),用于定制内容的创建。值得注意的是,我们称之为合成者的方法支持各种水平的条件,如作为当地指导的文字描述、深度地图和草图等,低层细节的颜色直方图等。除了改进控制性外,我们还确认作曲家作为一般框架发挥作用,促进一系列广泛的典型基因化任务,而无需再培训。代码和模型将被提供。