Given a command, humans can directly execute the action after thinking or choose to reject it, with reasonable feedback at the same time. However, the behavior of existing text-to-image generation methods are uncontrollable and irresponsible. In this paper, we construct extensive experiments to verify whether they can be accountable (say no and explain why) for those prohibited instructions. To this end, we define a novel text-based visual re-creation task and construct new synthetic CLEVR-NOT dataset (620K) and manually pictured Fruit-NOT dataset (50K). In our method, one text-image pair as the query is fed into the machine, and the model gives a yes or no answer after visual and textual reasoning. If the answer is yes, the image auto-encoder and auto-regressive transformer must complete the visual re-creation under the premise of ensuring image quality, otherwise the system needs to explain why the commands cannot be completed or prohibited. We provide a detailed analysis of experimental results in image quality, answer accuracy, and model behavior in the face of uncertainty and imperfect user queries. Our results demonstrate the difficulty of a single model for both textual and visual reasoning. We also hope our explorations and findings can bring valuable insights about the accountability of text-based image generation models. Code and datasets can be found at https://matrix-alpha.github.io.
翻译:根据指令,人类可以在思考后直接执行行动,或选择拒绝行动,同时提供合理的反馈。然而,现有文本到图像生成方法的行为是无法控制的,不负责任。在本文中,我们设计了广泛的实验,以核查它们是否能够对这些被禁止的指示负责(不解释原因)。为此,我们定义了一个新的基于文字的视觉再创造任务,并建造了新的合成合成的CLEVR-NOT数据集(620K)和手工绘制的Falest-NOT数据集(50K)。在我们的方法中,当查询被输入到机器时,一个文本成像配对,而模型在视觉和文字推理之后给出的回答是肯定的或没有回答。如果答案是肯定的,图像自动编码和自动递增变变变变变变变器必须在确保图像质量的前提下完成视觉再造,否则系统需要解释为什么无法完成或禁止指令。我们在图像质量、回答准确度和模范式用户质疑面前对实验性结果进行详细分析。我们的成果可以在视觉和不准确性用户查询的情况下,对模型进行一个是肯定的或没有答案的回答。我们的结果,我们的结果可以显示一个单一模型分析。我们在视觉模型上找到一个图像模型的图像分析。我们对文本的精确的判断和图像分析,也可以找到一个分析。</s>