Diffusion models entangle content and style generation during the denoising process, leading to undesired content modification or insufficient style strength when directly applied to stylization tasks. Existing methods struggle to effectively control the diffusion model to meet the aesthetic-level requirements for stylization. In this paper, we introduce DiffArtist, the first approach that enables aesthetic-aligned control of content and style during the entire diffusion process, without additional training. Our key insight is to design disentangled representations for content and style in the noise space. By sharing features between content and style representations, we enable fine-grained control of structural and appearance-level style strength without compromising visual-appeal. We further propose Vision-Language Model (VLM)-based evaluation metrics for stylization, which align better with human preferences. Extensive experiments demonstrate that DiffArtist outperforms existing methods in alignment with human preferences and offers enhanced controllability. Project homepage: https://DiffusionArtist.github.io
翻译:暂无翻译