The past few years have witnessed renewed interest in NLP tasks at the interface between vision and language. One intensively-studied problem is that of automatically generating text from images. In this paper, we extend this problem to the more specific domain of face description. Unlike scene descriptions, face descriptions are more fine-grained and rely on attributes extracted from the image, rather than objects and relations. Given that no data exists for this task, we present an ongoing crowdsourcing study to collect a corpus of descriptions of face images taken `in the wild'. To gain a better understanding of the variation we find in face description and the possible issues that this may raise, we also conducted an annotation study on a subset of the corpus. Primarily, we found descriptions to refer to a mixture of attributes, not only physical, but also emotional and inferential, which is bound to create further challenges for current image-to-text methods.
翻译:在过去几年里,在视觉和语言的界面上,人们重新关注国家语言方案的任务。一个深入研究的问题是从图像中自动生成文字。在本文中,我们将此问题扩大到更具体的面貌描述领域。与场景描述不同,面部描述更精细,依赖从图像中提取的属性,而不是对象和关系。鉴于没有关于这项任务的数据,我们正在进行众包研究,以收集在“野外”拍摄的面部图像的描述。为了更好地了解我们在面部描述中发现的变异以及这可能引发的问题,我们还对本体的一个子进行了注解研究。我们发现,描述主要指的是各种属性的混合,不仅是物理的,而且是情感和推测性的,这必然会给目前的图像到文字方法带来进一步的挑战。