Progress on many Natural Language Processing (NLP) tasks, such as text classification, is driven by objective, reproducible and scalable evaluation via publicly available benchmarks. However, these are not always representative of real-world scenarios where text classifiers are employed, such as sentiment analysis or misinformation detection. In this position paper, we put forward two points that aim to alleviate this problem. First, we propose to extend text classification benchmarks to evaluate the explainability of text classifiers. We review challenges associated with objectively evaluating the capabilities to produce valid explanations which leads us to the second main point: We propose to ground these benchmarks in human-centred applications, for example by using social media, gamification or to learn explainability metrics from human judgements.
翻译:文本分类等许多自然语言处理(NLP)任务的进展,是由客观、可复制和可扩展的评价驱动的,通过公开提供的基准进行客观、可复制和可扩缩的评价,然而,这些并不总能代表使用文字分类人员的现实世界情景,例如情绪分析或误传检测,在本立场文件中,我们提出了旨在缓解这一问题的两点。首先,我们提议扩大文字分类基准,以评价文字分类人员的解释性。我们审查与客观评估提出有效解释的能力有关的挑战,这使我们得出第二个要点:我们提议将这些基准置于以人为中心的应用中,例如利用社交媒体、拼写或学习人类判断的可解释性指标。