Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
翻译:文本到图像扩散模型在根据自然语言描述生成高质量图像方面表现出色,但在多个输出中保持主体一致性方面常常失败,这限制了它们在视觉叙事中的应用。现有方法依赖于模型微调或图像条件化,这些方法计算成本高昂且需要针对每个主体进行优化。1Prompt1Story是一种无需训练的方法,它将所有场景描述串联成一个单一提示并重新缩放词元嵌入,但该方法存在语义泄漏问题,即跨帧的嵌入变得纠缠,导致文本对齐失准。本文提出了一种简单而有效的无需训练方法,该方法通过从几何角度精炼文本嵌入以抑制不需要的语义,从而解决语义纠缠问题。大量实验证明,我们的方法在主体一致性和文本对齐方面均显著优于现有基线。