Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans can handle ambiguities effectively by asking clarifying questions and/or relying on contextual cues and common-sense knowledge, resolving ambiguities can be notoriously hard for machines. In this work, we study ambiguities that arise in text-to-image generative models. We curate a benchmark dataset covering different types of ambiguities that occur in these systems. We then propose a framework to mitigate ambiguities in the prompts given to the systems by soliciting clarifications from the user. Through automatic and human evaluations, we show the effectiveness of our framework in generating more faithful images aligned with human intention in the presence of ambiguities.
翻译:自然语言往往含有模糊不清,可能导致误解和错误沟通。虽然人类可以通过提出澄清问题和/或依靠背景线索和常识知识来有效处理模糊不清问题,但机器很难解决模糊不清的问题。在这项工作中,我们研究了文本到图像的基因化模型中出现的模糊不清。我们设计了一个基准数据集,涵盖这些系统中出现的不同类型的模糊不清。然后我们提出了一个框架,通过征求用户的澄清来减轻这些系统在提示上的模糊不清之处。我们通过自动和人文评估,展示了我们的框架在产生与人类意图一致的更加忠实的图像方面的有效性。