Human language users can generate descriptions of perceptual concepts beyond instance-level representations and also use such descriptions to learn provisional class-level representations. However, the ability of computational models to learn and operate with class representations is under-investigated in the language-and-vision field. In this paper, we train separate neural networks to generate and interpret class-level descriptions. We then use the zero-shot classification performance of the interpretation model as a measure of communicative success and class-level conceptual grounding. We investigate the performance of prototype- and exemplar-based neural representations grounded category description. Finally, we show that communicative success reveals performance issues in the generation model that are not captured by traditional intrinsic NLG evaluation metrics, and argue that these issues can be traced to a failure to properly ground language in vision at the class level. We observe that the interpretation model performs better with descriptions that are low in diversity on the class level, possibly indicating a strong reliance on frequently occurring features.
翻译:人类语言使用者可以提出超越实例层面的描述,也可以使用这种描述来学习临时的类级代表。然而,在语言和视觉领域,计算模型学习和与阶级代表一起运作的能力调查不足。在本文中,我们培训单独的神经网络来产生和解释等级描述。然后我们使用解释模型的零点分类性能作为交流成功和等级级概念依据的衡量尺度。我们调查原型和基于实例的神经代表基于类别描述的性能。最后,我们表明交流性成功揭示了代代代型模型的性能问题,而传统的NLG内在评价指标没有抓住这些问题,并说这些问题可以追溯到在阶级层面未能正确定位语言。我们观察到,解释模型在类级层次上描述多样性较低的情况下效果更好,可能表明对经常发生的特征的高度依赖。</s>