Testing practices within the machine learning (ML) community have centered around assessing a learned model's predictive performance measured against a test dataset, often drawn from the same distribution as the training dataset. While recent work on robustness and fairness testing within the ML community has pointed to the importance of testing against distributional shifts, these efforts also focus on estimating the likelihood of the model making an error against a reference dataset/distribution. We argue that this view of testing actively discourages researchers and developers from looking into other sources of robustness failures, for instance corner cases which may have severe undesirable impacts. We draw parallels with decades of work within software engineering testing focused on assessing a software system against various stress conditions, including corner cases, as opposed to solely focusing on average-case behaviour. Finally, we put forth a set of recommendations to broaden the view of machine learning testing to a rigorous practice.
翻译:机器学习(ML)社区内部的测试做法围绕着评估一个以测试数据集衡量的、通常与培训数据集相同分布的测试模型的预测性能,虽然最近在ML社区内部关于稳健性和公平性测试的工作指出了对照分布式转换进行测试的重要性,但这些努力还侧重于估计模型对参考数据集/分布发生错误的可能性。我们认为,这种测试观点积极阻止研究人员和开发人员寻找其他稳健性缺陷的来源,例如可能产生严重不良影响的角落案例。我们与软件工程测试数十年的工作相平行,重点是评估软件系统如何应对各种压力条件,包括角落案例,而不是仅仅侧重于普通案例行为。最后,我们提出了一套建议,以扩大机器学习测试的视角,使之成为一种严格的实践。