Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.
翻译:近期图像生成方法常将主题、风格与结构驱动条件独立处理,导致特征纠缠与任务迁移性受限。本文提出3SGen——一个任务感知的统一框架,可在单一模型中实现三种条件模式的协同生成。3SGen采用配备可学习语义查询的多模态大语言模型(MLLM)对齐文本-图像语义,并通过变分自编码器(VAE)分支保留细粒度视觉细节。其核心自适应任务特定记忆(ATM)模块通过轻量门控机制与可扩展记忆单元,动态解耦、存储并检索条件特定先验(如主题身份、风格纹理与结构空间布局)。该设计有效缓解任务间干扰,并自然支持组合式输入。此外,我们构建了3SGen-Bench——配备标准化评估指标的图像驱动生成统一基准,用于衡量跨任务保真度与可控性。在自建3SGen-Bench及公开基准上的大量实验表明,本方法在多样化图像驱动生成任务中均取得优越性能。