Machine Learning (ML) models now inform a wide range of human decisions, but using ``black box'' models carries risks such as relying on spurious correlations or errant data. To address this, researchers have proposed methods for supplementing models with explanations of their predictions. However, robust evaluations of these methods' usefulness in real-world contexts have remained elusive, with experiments tending to rely on simplified settings or proxy tasks. We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting by relaxing its simplifying assumptions. Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results. Beyond the present experiment, we believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
翻译:机器学习(ML)模型现在为人类广泛的决策提供了信息,但使用“黑匣子”模型会带来风险,如依赖虚假的关联或错误的数据。为了解决这个问题,研究人员提出了补充模型的方法,并解释了他们的预测。然而,对这些方法在现实世界环境中的有用性进行强有力的评价仍然难以实现,实验往往依赖于简化设置或代理任务。我们提出了一个实验性研究,将先前可以解释的 ML 评估实验扩展至更接近部署环境,放松简化的假设。我们的经验性研究得出了与先前工作截然不同的结论,突出了似乎微不足道的实验设计选择如何产生误导的结果。我们认为,除了目前的实验之外,这项工作还总结出有必要确定任何ML 方法的评价地点,并选择适当的任务、数据、用户和衡量标准,以适应预期部署环境。