Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their `difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance -- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.
翻译:顶层领导板在现实世界应用中往往表现不尽人意的模型; 这需要严格和昂贵的部署前模型测试。 迄今尚未探索的模型性能的一个方面是: 我们的领导板是否在进行公平的评估? 在本文中,我们引入了一种任务不可知性的方法,通过根据“困难”程度对头板进行加权检查; 我们发现, 头板可能遭到对抗性攻击, 而顶级表现模型可能并不总是最佳模型。 我们随后提议了替代评价指标。 我们对10个模型的实验显示模型排名的变化以及先前报告的绩效的总体下降,从而纠正了对AI系统能力的过高估计。 在行为测试原则的启发下,我们进一步开发了一个视觉分析工具的原型, 使领导板能够根据终端用户的重点领域进行定制改造。 这有助于用户分析模型的优势和弱点, 并指导他们选择最适合其应用的模型。 在用户研究中, 各种商业产品开发团队的成员, 覆盖了5个重点领域, 发现我们的平均努力率降低了原型的开发测试。