The ability of intelligent agents to play games in human-like fashion is popularly considered a benchmark of progress in Artificial Intelligence. Similarly, performance on multi-disciplinary tasks such as Visual Question Answering (VQA) is considered a marker for gauging progress in Computer Vision. In our work, we bring games and VQA together. Specifically, we introduce the first computational model aimed at Pictionary, the popular word-guessing social game. We first introduce Sketch-QA, an elementary version of Visual Question Answering task. Styled after Pictionary, Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data. Notably, Sketch-QA involves asking a fixed question ("What object is being drawn?") and gathering open-ended guess-words from human guessers. We analyze the resulting dataset and present many interesting findings therein. To mimic Pictionary-style guessing, we subsequently propose a deep neural model which generates guess-words in response to temporally evolving human-drawn sketches. Our model even makes human-like mistakes while guessing, thus amplifying the human mimicry factor. We evaluate our model on the large-scale guess-word dataset generated via Sketch-QA task and compare with various baselines. We also conduct a Visual Turing Test to obtain human impressions of the guess-words generated by humans and our model. Experimental results demonstrate the promise of our approach for Pictionary and similarly themed games.
翻译:智能分子以人样的方式玩游戏的能力被普遍认为是人工智能进步的基准。 同样,视觉问答(VQA)等多学科任务的业绩也被认为是衡量计算机视野进展的标志。 我们在工作中把游戏和VQA联系在一起。 具体地说, 我们引入了第一个计算模型, 以Pictionary为对象, 流行的单词猜测社会游戏为对象。 我们首先引入了 Sletch- QA, 这是视觉问答任务的基本版本。 风格化为Pictionary, Scletch- QA 使用累积的素描中线序列作为视觉数据。 值得注意的是, Scletch- QA 涉及一个固定的问题( “ 正在绘制什么对象” ), 以及从人类猜测者那里收集开放的猜测词。 我们分析了由此产生的数据集, 并展示了其中的许多有趣的发现。 我们随后提出了一个深度的神经模型, 用来根据时间变化的人类画图画来产生猜测。 我们的模型甚至用人样式的图解错误来推断人类的图象, 并且用我们以精确的模型来推断的模型来推断我们所制作的模型。