Despite unprecedented ability in imaginary creation, large text-to-image models are further expected to express customized concepts. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder for fast and accurate concept customization, which consists of global and local mapping networks. In specific, the global mapping network separately projects the hierarchical features of a given image into multiple ``new'' words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with prior optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables more high-fidelity inversion and robust editability with a significantly faster encoding process. Our code will be publicly available at https://github.com/csyxwei/ELITE.
翻译:尽管在想象中创建了前所未有的能力,大型文字到图像模型还有望进一步表达定制概念。现有工作通常以优化为基础学习这些概念,但会带来过重的计算或记忆负担。在本文件中,我们提议采用一个基于学习的编码器,用于快速和准确的概念定制,由全球和地方绘图网络组成。具体地说,全球绘图网络将特定图像的分级特性分别投射成多个“新”文字嵌入空间中的“新”字,即一个易于理解的概念的主要词和其他辅助词,以排除无关的干扰(例如背景)。与此同时,一个本地绘图网络将把编码的补丁功能输入交叉注意层,以提供遗漏的细节,同时不牺牲主要概念的可修改性。我们比较我们的方法与各种用户定义概念的先前基于优化的方法,并表明我们的方法能够使更高的虚伪和稳健的可编辑性与快速编码过程。我们的代码将在https://github.com/csyxwei/ELITE中公开提供。</s>