图像生成链：迈向可监控与可控的图像生成 (Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation)

While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a "black box." This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.

翻译：尽管当前最先进的图像生成模型在视觉质量上取得了显著成就，但其内部生成过程仍是一个“黑箱”。这种不透明性限制了人类的观察与干预能力，并对确保模型可靠性、安全性与可控性构成了障碍。此外，其非人类式的工作流程使得人类观察者难以进行解释。为解决这一问题，我们提出了图像生成链（CoIG）框架，该框架将图像生成重构为一个类似于人类艺术创作的顺序性语义过程。类似于思维链（CoT）为大型语言模型（LLMs）带来的可监控性与性能优势，CoIG能够在文本到图像生成中产生同等的效益。CoIG利用LLM将复杂提示分解为一系列简单、逐步的指令。随后，图像生成模型通过逐步生成和编辑图像来执行该计划。每个步骤专注于单个语义实体，从而实现直接监控。我们使用两个新颖指标正式评估了这一特性：CoIG可读性——通过每个中间步骤的对应输出评估其清晰度；以及因果相关性——量化每个程序步骤对最终生成图像的影响。我们进一步证明，该框架通过将复杂生成任务分解为简单的子问题（类似于CoT所采用的过程推理），缓解了实体坍缩问题。实验结果表明，与现有基线模型相比，CoIG在实现具有竞争力的组合鲁棒性的同时，显著提升了定量可监控性。该框架与模型无关，可与任何图像生成模型集成。