As the use of machine learning in high impact domains becomes widespread, the importance of evaluating safety has increased. An important aspect of this is evaluating how robust a model is to changes in setting or population, which typically requires applying the model to multiple, independent datasets. Since the cost of collecting such datasets is often prohibitive, in this paper, we propose a framework for analyzing this type of stability using the available data. We use the original evaluation data to determine distributions under which the algorithm performs poorly, and estimate the algorithm's performance on the "worst-case" distribution. We consider shifts in user defined conditional distributions, allowing some distributions to shift while keeping other portions of the data distribution fixed. For example, in a healthcare context, this allows us to consider shifts in clinical practice while keeping the patient population fixed. To address the challenges associated with estimation in complex, high-dimensional distributions, we derive a "debiased" estimator which maintains $\sqrt{N}$-consistency even when machine learning methods with slower convergence rates are used to estimate the nuisance parameters. In experiments on a real medical risk prediction task, we show this estimator can be used to analyze stability and accounts for realistic shifts that could not previously be expressed. The proposed framework allows practitioners to proactively evaluate the safety of their models without requiring additional data collection.
翻译:随着在高影响领域广泛使用机器学习,评估安全的重要性已经提高。评估的一个重要方面是评价一个模型是如何稳健的模型是如何改变设定或人口,这通常要求将模型应用于多个独立的数据集。由于收集这类数据集的成本往往令人望而却步,我们在本文件中提议了一个框架,用现有数据分析这种类型的稳定性。我们使用原始评估数据来确定算法表现不佳的分布,并估计算法在“最坏情况”分布上的性能。我们考虑用户定义的有条件分布的变化,允许某些分布在保持数据分配的其他部分的同时进行转移。例如,在保健方面,这使我们能够考虑临床实践的变化,同时保持病人的固定。为了应对与复杂、高维分布中的估计有关的挑战,我们用“偏差”的估算器来维持算法的差值,并估计算法在“最差情况”分布上的性能。我们考虑到用户定义的有条件分布,允许某些分布在保持数据分配的其他部分的同时进行移动。在实际医疗风险预测的实验中,让我们考虑临床实践的变化,同时保持病人的固定;为了应对与前期的稳定性分析,我们可显示的模型,我们使用这种分析,以便分析。为了进行真正的稳定分析,我们可使用这种分析。