Text-to-image (T2I) generation has made remarkable progress, yet existing systems still lack intuitive control over spatial composition, object consistency, and multi-step editing. We present $\textbf{LayerCraft}$, a modular framework that uses large language models (LLMs) as autonomous agents to orchestrate structured, layered image generation and editing. LayerCraft supports two key capabilities: (1) $\textit{structured generation}$ from simple prompts via chain-of-thought (CoT) reasoning, enabling it to decompose scenes, reason about object placement, and guide composition in a controllable, interpretable manner; and (2) $\textit{layered object integration}$, allowing users to insert and customize objects -- such as characters or props -- across diverse images or scenes while preserving identity, context, and style. The system comprises a coordinator agent, the $\textbf{ChainArchitect}$ for CoT-driven layout planning, and the $\textbf{Object Integration Network (OIN)}$ for seamless image editing using off-the-shelf T2I models without retraining. Through applications like batch collage editing and narrative scene generation, LayerCraft empowers non-experts to iteratively design, customize, and refine visual content with minimal manual effort. Code will be released at https://github.com/PeterYYZhang/LayerCraft.
翻译:文本到图像(T2I)生成技术已取得显著进展,但现有系统在空间构图控制、对象一致性保持以及多步骤编辑方面仍缺乏直观的操作方式。本文提出$\textbf{LayerCraft}$——一个模块化框架,其利用大语言模型(LLMs)作为自主智能体来协调结构化、分层式的图像生成与编辑过程。LayerCraft具备两大核心能力:(1)通过思维链(CoT)推理实现$\textit{结构化生成}$,能够从简单提示词出发分解场景、推理对象布局,并以可控、可解释的方式引导构图;(2)$\textit{分层对象集成}$功能,允许用户在不同图像或场景中插入并自定义对象(如角色或道具),同时保持对象身份、上下文与风格的连贯性。该系统包含协调智能体、用于CoT驱动布局规划的$\textbf{ChainArchitect}$模块,以及基于现成T2I模型实现无缝图像编辑的$\textbf{对象集成网络(OIN)}$,且无需重新训练模型。通过批量拼贴编辑、叙事场景生成等应用案例,LayerCraft使非专业用户能够以最少的手动操作迭代式地设计、定制与优化视觉内容。代码将在https://github.com/PeterYYZhang/LayerCraft发布。