In this paper, we aim to estimate the prediction error of machine learning models under the true distribution of the data on hand. We consider the prediction model as a data-driven black-box function and quantify its statistical properties using non-parametric methods. We propose a novel sampling technique that takes advantage of the underlying probability distribution information embedded in the data. The proposed method combines two existing frameworks for estimating the prediction inaccuracy error; $m$ out of $n$ bootstrapping and iterative bootstrapping. $m$ out of $n$ bootstrapping is to maintain the consistency, and iterative bootstrapping is often used for bias correction of the prediction error estimation. Using Monte-Carlo uncertainty quantification techniques, we disintegrate the total variance of the estimator so the user can make informed decisions regarding measures to overcome the preventable errors. In addition, via the same Monte-Carlo framework, we provide a way to estimate the bias due to using the empirical distribution. This bias captures the sensitivity of the estimator to the on hand input data and help with understanding the robustness of the estimator. The application of the proposed uncertainty quantification is tested in a model selection case study using simulated and real datasets. We evaluate the performance of the proposed estimator in two frameworks; first, directly applying is as an optimization model to find the best model; second, fixing an optimization engine and use the proposed estimator as a fitness function withing the optimizer. Furthermore, we compare the asymptotic statistical properties and numerical results in a finite dataset of the proposed estimator with the existing state-of-the-art methods.
翻译:在本文中,我们的目标是在实际分发手头数据的情况下估计机器学习模型的预测错误。我们认为预测模型是数据驱动的黑盒功能,并且使用非参数方法量化其统计属性。我们建议采用新的抽样技术,利用数据中所包含的基本概率分布信息;拟议方法结合了两个现有的框架,以估计预测不准确错误;用美元来计算踢踏和迭接靴的偏差。 以美元为单位的靴子比价中美元,用于保持一致性,并经常用迭接靴来纠正预测错误估计偏差。我们使用蒙特-卡洛不确定性量化技术,以解析其统计属性的总差异,以便用户能够就克服可预防错误的措施作出知情的决定。此外,我们通过同样的蒙特-卡洛框架,提供了一种方法来估计因使用经验分布而产生的偏差。这种偏差反映了精度的精度对手输入数据的敏感性,并有助于理解估测结果的准确度。使用蒙特-卡洛的不确定性定量量化技术,我们用估算的当前两个模型,即模拟的模型,我们用一个模拟的模型,即模拟的计算结果,我们用一个模拟的模型来进行模拟的计算。