We challenge AI models to "demonstrate understanding" of the sophisticated multimodal humor of The New Yorker Caption Contest. Concretely, we develop three carefully circumscribed tasks for which it suffices (but is not necessary) to grasp potentially complex and unexpected relationships between image and caption, and similarly complex and unexpected allusions to the wide varieties of human experience; these are the hallmarks of a New Yorker-caliber cartoon. We investigate vision-and-language models that take as input the cartoon pixels and caption directly, as well as language-only models for which we circumvent image-processing by providing textual descriptions of the image. Even with the rich multifaceted annotations we provide for the cartoon images, we identify performance gaps between high-quality machine learning models (e.g., a fine-tuned, 175B parameter language model) and humans. We publicly release our corpora including annotations describing the image's locations/entities, what's unusual about the scene, and an explanation of the joke.
翻译:具体地说,我们开发了三种精心限定的任务,这些任务足以(但并非必要)掌握图像和字幕之间潜在的复杂和意外的关系,以及类似的复杂和意外的人类经验种类的描述;这些是纽约卡通的标志。我们调查视觉和语言模型,这些模型直接输入漫画像素和字幕,以及我们通过提供图像文字描述绕过图像处理的只使用语言的模型。即使我们为漫画图像提供了内容丰富的多面性说明,我们也查明了高质量机器学习模型(例如微调的175B参数语言模型)和人类之间的性能差距。我们公开发布我们的构件,包括描述图像位置/内容的说明、场景的不寻常之处以及笑话的解释。