Recent breakthroughs in NLP research, such as the advent of Transformer models have indisputably contributed to major advancements in several tasks. However, few works research robustness and explainability issues of their evaluation strategies. In this work, we examine the behavior of high-performing pre-trained language models, focusing on the task of semantic similarity for visual vocabularies. First, we address the need for explainable evaluation metrics, necessary for understanding the conceptual quality of retrieved instances. Our proposed metrics provide valuable insights in local and global level, showcasing the inabilities of widely used approaches. Secondly, adversarial interventions on salient query semantics expose vulnerabilities of opaque metrics and highlight patterns in learned linguistic representations.
翻译:最近国家语言方案研究的突破,例如变异模型的出现,无可争议地促进了若干任务的重大进展,然而,几乎没有研究强健性和可解释的评价战略问题。在这项工作中,我们审查了高绩效的经培训前语言模式的行为,重点是视觉词汇的语义相似性任务。首先,我们讨论了需要解释性的评价指标,这是理解检索实例概念质量所必需的。我们提议的衡量标准在地方和全球两级提供了宝贵的洞察力,显示了广泛使用的方法的不易性。第二,关于突出的查询语言的对抗性干预暴露了不透明的计量的易用词的脆弱性,并突显了学习的语言表述模式。