A growing body of research runs human subject evaluations to study whether providing users with explanations of machine learning models can help them with practical real-world use cases. However, running user studies is challenging and costly, and consequently each study typically only evaluates a limited number of different settings, e.g., studies often only evaluate a few arbitrarily selected explanation methods. To address these challenges and aid user study design, we introduce Use-Case-Grounded Simulated Evaluations (SimEvals). SimEvals involve training algorithmic agents that take as input the information content (such as model explanations) that would be presented to each participant in a human subject study, to predict answers to the use case of interest. The algorithmic agent's test set accuracy provides a measure of the predictiveness of the information content for the downstream use case. We run a comprehensive evaluation on three real-world use cases (forward simulation, model debugging, and counterfactual reasoning) to demonstrate that Simevals can effectively identify which explanation methods will help humans for each use case. These results provide evidence that SimEvals can be used to efficiently screen an important set of user study design decisions, e.g. selecting which explanations should be presented to the user, before running a potentially costly user study.
翻译:越来越多的研究机构进行人体主题评价,研究向用户解释机器学习模型是否有助于他们实际使用现实世界的案例。然而,运行中的用户研究具有挑战性和费用高昂,因此,每份研究通常只评价有限的不同环境,例如,研究往往只评价少数任意选择的解释方法。为了应对这些挑战和协助用户研究设计,我们引入了使用系统模拟评价(SimEvals)。SimEvals涉及培训算法剂,这些算法剂将信息内容(如模型解释)作为输入信息内容(如模型解释),提供给人类专题研究的每个参与者,以预测对感兴趣的使用案例的答案。算法代理测试设置的精确度测量了下游使用案例信息内容的预测性。我们用三个真实世界的案例(前期模拟、模型调试和反事实推理)进行全面评价,以证明Simevals能够有效地确定哪些解释方法有助于人类使用每个案例。这些结果提供了证据,SimEvals可以用来在有效筛选重要的用户设计决定之前,进行重要的用户设计研究。