Have you ever imagined what a corgi-alike coffee machine or a tiger-alike rabbit would look like? In this work, we attempt to answer these questions by exploring a new task called semantic mixing, aiming at blending two different semantics to create a new concept (e.g., corgi + coffee machine -- > corgi-alike coffee machine). Unlike style transfer, where an image is stylized according to the reference style without changing the image content, semantic blending mixes two different concepts in a semantic manner to synthesize a novel concept while preserving the spatial layout and geometry. To this end, we present MagicMix, a simple yet effective solution based on pre-trained text-conditioned diffusion models. Motivated by the progressive generation property of diffusion models where layout/shape emerges at early denoising steps while semantically meaningful details appear at later steps during the denoising process, our method first obtains a coarse layout (either by corrupting an image or denoising from a pure Gaussian noise given a text prompt), followed by injection of conditional prompt for semantic mixing. Our method does not require any spatial mask or re-training, yet is able to synthesize novel objects with high fidelity. To improve the mixing quality, we further devise two simple strategies to provide better control and flexibility over the synthesized content. With our method, we present our results over diverse downstream applications, including semantic style transfer, novel object synthesis, breed mixing, and concept removal, demonstrating the flexibility of our method. More results can be found on the project page https://magicmix.github.io
翻译:您是否曾想象过类似咖啡机或老虎兔会是什么样子? 在这项工作中,我们试图通过探索一个名为语义混合的新任务来回答这些问题,新任务叫做语义混合,目的是混合两种不同的语义来创建新概念(例如,corgi+咖啡机 -- -- > corgi+咖啡机 -- -- > corgi-类似咖啡机) 。与风格转换不同,根据参考风格将图像按照图像内容不改变,语义混合以语义方式混合两种不同的概念,以合成一个新概念,同时保存空间布局和几何地测量。为此,我们展示了“魔法Mix”应用,这是基于预先训练过的受文本限制的传播模式的简单而有效的解决方案。在早期的调试步骤中,布局/沙皮的特性出现在后一步出现,而语义上有意义的细节则在淡化过程中出现,我们的方法首先获得了一种粗略的布局(要么以语义方式表达一个图像,要么从纯高调的图像中解析一个概念中找到一个新的概念概念,,然后是更精确的文本精度,我们更精确地将使用一种方法进行。