Uncertainty quantification is essential for scientific analysis, as it allows for the evaluation and interpretation of variability and reliability in complex systems and datasets. In their original form, multivariate statistical regression models (partial least-squares regression, PLS, principal component regression, PCR) along with their kernelized versions (kernel partial least-squares regression, K-PLS, kernel principal component regression, K-PCR), do not incorporate uncertainty quantification as part of their output. In this study, we propose a method inspired by conformal inference to estimate and calibrate the uncertainty of multivariate statistical models. The result of this method is a point prediction accompanied by prediction intervals that depend on the input data. We tested the proposed method on both traditional and kernelized versions of PLS and PCR. The method is demonstrated using synthetic data, as well as laboratory near-infrared (NIR) and airborne hyperspectral regression models for estimating functional plant traits. The model was able to successfully identify the uncertain regions in the simulated data and match the magnitude of the uncertainty. In real-case scenarios, the optimised model was not overconfident nor underconfident when estimating from test data: for example, for a 95% prediction interval, 95% of the true observations were inside the prediction interval.
翻译:不确定性量化对于科学分析至关重要,因为它能够评估和解释复杂系统和数据集中的变异性与可靠性。在其原始形式中,多元统计回归模型(偏最小二乘回归,PLS;主成分回归,PCR)及其核化版本(核偏最小二乘回归,K-PLS;核主成分回归,K-PCR)并未将不确定性量化作为其输出的一部分。在本研究中,我们提出了一种受保形推断启发的方法,用于估计和校准多元统计模型的不确定性。该方法的结果是得到一个点预测,并附带依赖于输入数据的预测区间。我们在PLS和PCR的传统版本及核化版本上测试了所提出的方法。该方法通过合成数据,以及用于估算植物功能性状的实验室近红外(NIR)和机载高光谱回归模型进行了演示。该模型能够成功识别模拟数据中的不确定区域,并与不确定性的程度相匹配。在实际案例中,优化后的模型在根据测试数据进行估算时既不过度自信也不自信不足:例如,对于一个95%的预测区间,95%的真实观测值位于预测区间内。