Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
翻译:扩散模型已在文本到图像生成领域确立了最先进的技术水平,但其性能通常依赖于扩散先验网络将文本嵌入映射到视觉流形以便于解码。这些先验网络计算成本高昂,且需要在大规模数据集上进行广泛训练。在本研究中,我们通过采用基于优化的视觉反演(OVI)——一种无需训练且无需数据的替代方案——来挑战训练先验的必要性。OVI从随机伪标记初始化潜在视觉表示,并通过迭代优化以最大化其与输入文本提示嵌入的余弦相似度。我们进一步提出了两种新颖的约束条件:基于马氏距离的损失函数和最近邻损失函数,以规范OVI优化过程朝向真实图像的分布。我们在Kandinsky 2.2上进行的实验表明,OVI可以作为传统先验的替代方案。更重要的是,我们的分析揭示了当前评估基准(如T2I-CompBench++)中存在一个关键缺陷:尽管感知质量较低,但仅使用文本嵌入作为先验即可获得惊人的高分。我们提出的约束OVI方法在此基线基础上提升了视觉保真度,其中最近邻方法尤其有效,其定量得分与最先进的数据高效先验相当甚至更高,这表明该理念值得进一步研究。代码将在论文被接受后公开提供。