The recent large-scale generative modeling has attained unprecedented performance especially in producing high-fidelity images driven by text prompts. Text inversion (TI), alongside the text-to-image model backbones, is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens. Despite that, we find and show that the deployment of TI remains full of "dark-magics" -- to name a few, the harsh requirement of additional datasets, arduous human efforts in the loop and lack of robustness. In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework. The core to COTI is a theoretically-guided loss objective instantiated with a comprehensive and novel weighted scoring mechanism, encapsulated by an active-learning paradigm. The extensive results show that COTI significantly outperforms the prior TI-related approaches with a 26.05 decrease in the FID score and a 23.00% boost in the R-precision.
翻译:最近的大规模生成性建模在驱动文本提示的情况下,特别是在生成高保真度图像方面取得了前所未有的性能。文本反转(TI)与文本到图像模型骨干一起,被提出作为在提示包含用户定义、未知或长尾概念标记时个性化生成的有效技术。尽管如此,我们发现并展示了TI的部署仍然充满“黑魔法”——其中包括需要额外数据集的严苛要求、繁琐的人类工作循环和缺乏稳健性。在这项工作中,我们提出了一种更加先进的TI版本,称为可控制的文本反转(COTI),以解决所有上述问题,并提供一个稳健、数据高效且易于使用的框架。 COTI的核心是通过一个全面且新颖的加权评分机制实例化的理论引导损失目标,由主动学习范式封装。广泛的结果表明,COTI在FID分数上表现显着优于以前的TI相关方法,降低了26.05,在R-precision上提高了23.00%。