Cross-validation (CV) is a standard technique used across science to test how well a model predicts new data. Data are split into $K$ ``folds,'' where one fold (i.e., hold-out set) is used to evaluate a model's predictive ability, with folds cycled in standard $K$-fold CV. Researchers typically rely on conventions when choosing the hold-out size, commonly $80/20$ split, or $K=5$, even though this choice can affect inference and model evaluation. Principally, this split should be determined by balancing the predictive accuracy (bias) and the uncertainty of this accuracy (variance), which forms a tradeoff based on the size of the hold-out set. More training data means more accurate models, but fewer testing data lead to uncertain evaluation, and vice versa. The challenge is that this evaluation uncertainty cannot be identified, without strong assumptions, directly from data. We propose a procedure to determine the optimal hold-out size by deriving a finite-sample exact expression and upper bound on the evaluation uncertainty, depending on the error assumption, and adopting a utility-based approach to make this tradeoff explicit. Analyses of real-world datasets using linear regression and random forest demonstrate this procedure in practice, providing insight into implicit assumptions, robustness, and model performance. Critically, the results show that the optimal hold-out size depends on both the data and the model, and that conventional choices implicitly make assumptions about the fundamental characteristics of the data. Our framework makes these assumptions explicit and provides a principled, transparent way to select this split based on the data and model rather than convention. By replacing a one-size-fits-all choice with context-specific reasoning, it enables more reliable comparisons of predictive performance across scientific domains.
翻译:交叉验证(CV)是科学领域用于评估模型对新数据预测能力的标准技术。数据被划分为$K$个“折”,其中一折(即留出集)用于评估模型的预测能力,在标准的$K$折交叉验证中这些折会循环使用。研究人员在选择留出集大小时通常依赖惯例,常见的是$80/20$划分或$K=5$,尽管这一选择可能影响统计推断和模型评估。原则上,这种划分应通过平衡预测准确性(偏差)与该准确性的不确定性(方差)来确定,这形成了基于留出集大小的权衡关系。更多的训练数据意味着模型更准确,但更少的测试数据会导致评估不确定性增加,反之亦然。挑战在于,若无强假设,这种评估不确定性无法直接从数据中识别。我们提出了一种确定最佳留出集大小的方法:通过推导有限样本下评估不确定性的精确表达式和上界(取决于误差假设),并采用基于效用的方法使这种权衡关系显式化。使用线性回归和随机森林对真实数据集的分析展示了该方法的实际应用,揭示了隐含假设、鲁棒性和模型性能。关键结果表明,最佳留出集大小同时取决于数据和模型,而传统选择隐含地对数据的基本特性作出了假设。我们的框架使这些假设显式化,并提供了一种基于数据和模型(而非惯例)来选择划分的原则性、透明方法。通过用特定情境的推理取代一刀切的选择,该方法能够在科学领域中实现更可靠的预测性能比较。