LayerCraft：基于思维链推理与分层对象集成的文本到图像生成增强方法 (LayerCraft: Enhancing Text-to-Image Generation with CoT Reasoning and Layered Object Integration)

Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.

翻译：文本到图像（T2I）生成技术已取得显著进展，但现有系统在空间构图控制、对象一致性保持以及多步骤编辑方面仍缺乏直观的操作方式。本文提出$\textbf{LayerCraft}$——一个模块化框架，其利用大语言模型（LLMs）作为自主智能体来协调结构化、分层式的图像生成与编辑过程。LayerCraft具备两大核心能力：（1）通过思维链（CoT）推理实现$\textit{结构化生成}$，能够从简单提示词出发分解场景、推理对象布局，并以可控、可解释的方式引导构图；（2）$\textit{分层对象集成}$功能，允许用户在不同图像或场景中插入并自定义对象（如角色或道具），同时保持对象身份、上下文与风格的连贯性。该系统包含协调智能体、用于CoT驱动布局规划的$\textbf{ChainArchitect}$模块，以及基于现成T2I模型实现无缝图像编辑的$\textbf{对象集成网络（OIN）}$，且无需重新训练模型。通过批量拼贴编辑、叙事场景生成等应用案例，LayerCraft使非专业用户能够以最少的手动操作迭代式地设计、定制与优化视觉内容。代码将在https://github.com/PeterYYZhang/LayerCraft发布。