In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.
翻译:在本篇文章中,我们提出了一个称为SPlit的最佳方法,将数据集分为培训和测试组。SPlit基于支持点方法,最初是用来寻找连续分布的最佳代表点。我们用相近的相邻算法对SP进行子抽样,从数据集中进行子抽样。我们还将SP用于处理绝对变量,以便SPlit既适用于回归问题,也适用于分类问题。在真实数据集上实施SPlit表明,与通常使用的随机分离程序相比,几种模型方法的最坏情况测试性能显著改善。