Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a \emph{gradient-free} framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.
翻译:最近个性化文本到图像生成的相关工作通常通过通过优化其嵌入来绑定特定主题或风格的一些给定图像。自然而然地,我们可以质疑是否可以通过只访问模型推理过程来优化文本反演。只需要在前向计算中确定文本反演可以保留更少的GPU内存,简单的部署和安全的访问,适用于可扩展的模型。在本文中,我们提出了一个“无梯度”框架,用于迭代演化策略中的连续文本反演的优化。具体而言,我们首先使用视觉和文字词汇信息来初始化适当的文本反演的令牌嵌入。然后,我们将演化策略的优化分解为搜索空间的维度缩减和子空间内的非凸无梯度优化,这显著加速了优化过程,并几乎没有性能损失。在几个应用程序的实验中,展示了我们提出的无梯度方法所配备的文本到图像模型的性能与具有变体GPU / CPU平台的基于梯度的对比方法相当,具有灵活的就业机会和计算效率。