Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.
翻译:暂无翻译