Conformal prediction is a popular, modern technique for providing valid predictive inference for arbitrary machine learning models. Its validity relies on the assumptions of exchangeability of the data, and symmetry of the given model fitting algorithm as a function of the data. However, exchangeability is often violated when predictive models are deployed in practice. For example, if the data distribution drifts over time, then the data points are no longer exchangeable; moreover, in such settings, we might want to use a nonsymmetric algorithm that treats recent observations as more relevant. This paper generalizes conformal prediction to deal with both aspects: we employ weighted quantiles to introduce robustness against distribution drift, and design a new randomization technique to allow for algorithms that do not treat data points symmetrically. Our new methods are provably robust, with substantially less loss of coverage when exchangeability is violated due to distribution drift or other challenging features of real data, while also achieving the same coverage guarantees as existing conformal prediction methods if the data points are in fact exchangeable. We demonstrate the practical utility of these new tools with simulations and real-data experiments on electricity and election forecasting.
翻译:----
预测置信区间是一种流行的、现代的技术,用于提供任意机器学习模型的有效预测推断。其有效性依赖于数据可交换性和给定模型拟合算法作为数据函数的对称性的假设。然而,在实践中,数据可交换性经常被违反。例如,如果数据分布随时间漂移,那么数据点就不再可交换。此外,在这种情况下,我们可能希望使用一种非对称算法,将最近的观测视为更重要的。本文将预测置信区间推广到同时处理这两个方面:我们使用加权分位数来引入对于分布漂移的鲁棒性,并设计了一种新的随机化技术,允许使用不对数据点进行对称处理的算法。我们的新方法具有可证明的鲁棒性,当由于数据分布漂移或其他挑战性的现实数据特征导致可交换性违反时,损失覆盖率显著降低,同时在数据点实际上可交换的情况下实现与现有预测置信区间方法相同的覆盖范围保证。我们通过电力和选举预测的模拟和真实数据实验证明了这些新工具的实用性。