Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
翻译:近期文本到图像扩散模型在视觉保真度方面取得了显著进展,但在复杂提示词的语义对齐方面仍存在困难。本文提出CritiFusion,一种新颖的推理时框架,通过整合多模态语义批判机制与频域优化来提升文本到图像的一致性与细节表现。所提出的CritiCore模块利用视觉语言模型与多个大语言模型来丰富提示语境并生成高层语义反馈,从而引导扩散过程使生成内容更好地符合提示意图。此外,SpecFusion在频谱域中融合中间生成状态,在注入粗粒度结构信息的同时保持高频细节。该方法无需额外模型训练,可作为即插即用的优化阶段兼容现有扩散模型主干。在标准基准测试上的实验表明,本方法显著提升了文本到图像对应度的人类对齐指标与视觉质量。CritiFusion在人类偏好评分与美学评估中持续提升性能,达到与当前最优奖励优化方法相当的结果。定性结果进一步展示了其在细节表现、真实感与提示忠实度方面的优越性,验证了本文语义批判与频谱对齐策略的有效性。