Text-to-image models have recently achieved remarkable success with seemingly accurate samples in photo-realistic quality. However as state-of-the-art language models still struggle evaluating precise statements consistently, so do language model based image generation processes. In this work we showcase problems of state-of-the-art text-to-image models like DALL-E with generating accurate samples from statements related to the draw bench benchmark. Furthermore we show that CLIP is not able to rerank those generated samples consistently. To this end we propose LogicRank, a neuro-symbolic reasoning framework that can result in a more accurate ranking-system for such precision-demanding settings. LogicRank integrates smoothly into the generation process of text-to-image models and moreover can be used to further fine-tune towards a more logical precise model.
翻译:文本到图像模型最近取得了令人瞩目的成功,在光学现实质量的表面上准确的样本中取得了显著的成功。然而,随着最先进的语言模型仍然难以始终一致地评估准确的语句,语言模型的图像生成过程也是如此。在这项工作中,我们展示了像DALL-E这样的最先进的文本到图像模型的问题,从与绘图基准有关的语句中生成了准确的样本。此外,我们还表明,CLIP无法始终如一地重新排列这些生成的样本。为此,我们提出了LogicRank,这是一个神经-同步逻辑推理框架,可以导致更精确的定级设置。 Logic-Rank顺利地融入了文本到图像模型的生成过程,并且可以用来进一步微调更符合逻辑的精确模型。