While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations, (e.g., werewolves to a full moon), used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player has to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, aiming to allow future data collection that can be used to develop models with better association abilities.
翻译:视觉和语言模型在视觉解答等任务上表现良好, 当涉及到基本的人类常识推理技巧时, 视觉和语言模型在视觉解答等任务上表现很好, 当涉及到基本的人类常识推理技能时, 视觉和语言模型很难。 在这项工作中, 我们引入了WinoGAVIL: 一个收集视觉和语言协会的在线游戏( 例如, 狼人通到满月), 用作评估最新艺术模型的动态基准。 在流行的卡片游戏代码名的启发下, 间谍总监给几个视觉候选人提供了文本提示, 而另一个玩家则必须识别它们。 人类玩家因创建对竞争的AI模型具有挑战性的协会而获得奖赏。 我们用游戏来收集3.5K实例, 发现它们对人类来说是直观的( > 90%的 Jacccard 指数), 但却是用来评估最新智能模型的动态基准。 在这种模型( ViLT) 的启发下, 获得52%的分数, 多数是视觉候选人的成绩。 我们的分析以及我们从玩家收集的反馈表明, 收集的协会需要不同的推理学技能, 包括一般的知识、 普通感、 抽象、 和抽象, 我们可以用的游戏和更多的收集。