Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into the stylized one according to textual descriptions of the target style provided by the user. Unlike previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler on the basis of diffusion models. The cross-modal style information can be easily integrated as guidance during the diffusion progress step-by-step. In particular, we use a dual diffusion processing architecture to control the balance between the content and style of the diffused results. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments.
翻译:尽管任意的图像引导风格传输方法产生了令人印象深刻的结果,但最近根据用户提供的目标样式的文字描述,提议将自然图像转换成Styliz化型图像,这与以往的图像到图像传输方法不同,文本引导的Styliz化进步为用户提供了一种更精确和直观的方式来表达所希望的样式。然而,跨模式输入/产出之间的巨大差异使得在典型的CNN的反馈前传送管道中进行文本驱动图像同步化具有挑战性。在本文中,我们根据传播模型介绍DiffStyler。跨模式信息可以很容易地作为传播进展一步的指南加以整合。特别是,我们使用一种双重的传播处理结构来控制传播结果的内容和风格之间的平衡。此外,我们提出了一个基于内容图像的可学习噪音,作为反向去音化进程的基础,使这种图像转换的结果能够更好地保护内容图像的结构信息。我们通过广泛的定量和定量方法验证拟议的DiffStyler系统测试。