3SGen：基于自适应任务特定记忆的统一主题、风格与结构驱动图像生成 (3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory)

Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.

翻译：近期图像生成方法常将主题、风格与结构驱动条件独立处理，导致特征纠缠与任务迁移性受限。本文提出3SGen——一个任务感知的统一框架，可在单一模型中实现三种条件模式的协同生成。3SGen采用配备可学习语义查询的多模态大语言模型（MLLM）对齐文本-图像语义，并通过变分自编码器（VAE）分支保留细粒度视觉细节。其核心自适应任务特定记忆（ATM）模块通过轻量门控机制与可扩展记忆单元，动态解耦、存储并检索条件特定先验（如主题身份、风格纹理与结构空间布局）。该设计有效缓解任务间干扰，并自然支持组合式输入。此外，我们构建了3SGen-Bench——配备标准化评估指标的图像驱动生成统一基准，用于衡量跨任务保真度与可控性。在自建3SGen-Bench及公开基准上的大量实验表明，本方法在多样化图像驱动生成任务中均取得优越性能。