Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
翻译:多主体定制旨在将多个用户指定的主体合成为一幅连贯的图像。为解决主体缺失或冲突等问题,近期研究引入了布局引导以提供显式的空间约束。然而,现有方法仍难以平衡三个关键目标:文本对齐、主体身份保持和布局控制,而对额外训练的依赖进一步限制了其可扩展性和效率。本文提出AnyMS,一种新颖的无需训练的布局引导多主体定制框架。AnyMS利用三种输入条件:文本提示、主体图像和布局约束,并引入一种自底向上的双层级注意力解耦机制,以在生成过程中协调它们的融合。具体而言,全局解耦将文本条件与视觉条件间的交叉注意力分离,以确保文本对齐;局部解耦则将每个主体的注意力限制在其指定区域内,从而防止主体冲突,进而保障身份保持与布局控制。此外,AnyMS采用预训练的图像适配器来提取与扩散模型对齐的主体特定特征,无需进行主体学习或适配器微调。大量实验表明,AnyMS实现了最先进的性能,支持复杂构图并能扩展到更多数量的主体。