Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation \textit{\textbf{personalized attribute-reasoning generation}}. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and attribute-reasoning generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized attribute-reasoning generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.
翻译:个性化模型在理解和生成用户提供的概念方面已展现出显著成功。然而,现有方法为理解与生成任务分别使用独立的概念标记,将这两项任务孤立对待。这可能导致在生成包含复杂提示的图像时存在局限。例如,给定概念$\langle bo\rangle$,若缺乏对其帽子的额外文本描述,则难以生成"$\langle bo\rangle$戴着它的帽子"的图像。我们将此类生成任务称为\textbf{个性化属性推理生成}。为突破此局限,本文提出UniCTokens——一种将个性化信息有效整合至统一视觉语言模型(VLM)以同时支持理解与生成任务的新框架。UniCTokens通过训练一组统一概念标记来利用互补语义,从而提升两项个性化任务的性能。此外,我们提出包含三个阶段(理解预热、从理解引导生成、通过生成深化理解)的渐进式训练策略,以增强任务间的相互促进效应。为量化评估统一VLM的个性化能力,我们构建了首个综合性基准测试集UnifyBench,用于评估概念理解、概念生成及属性推理生成任务。在UnifyBench上的实验结果表明:UniCTokens在概念理解与概念生成任务上相较于主流方法展现出竞争优势,并在个性化属性推理生成任务中取得了最先进的性能。我们的研究证实:增强理解能提升生成质量,而生成过程亦能为理解任务提供有价值的反馈。代码与数据集发布于:\href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}。