We propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split the data. The splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications. We evaluate our proposal in twenty-two benchmark datasets with classifiers such as Multi-Layer Perceptron, Support Vector Machine, Random Forest and K-Nearest Neighbors, and five similarity functions Cityblock, Chebyshev, Cosine, Correlation, and Euclidean. According to the Wilcoxon Sign-Rank test, our approach consistently outperformed ordinary stratified 10-fold cross-validation in 75\% of the assessed scenarios.
翻译:我们建议采用基于相似的分解(SBSS)技术,使用输出和输入空间信息来分割数据。这些分解是利用样本中的相似功能生成的,以将相似的样本置于不同的分解中。这个方法可以更好地在培训阶段展示数据。这个战略导致在现实应用中使用更加现实的性能估计。我们用多种视距、支持矢量机、随机森林和K-近距离近距离仪等分类器以及五个相似功能城市区块、Chebyshev、Cosine、Correlation和Euclidean来评估我们的22个基准数据集。根据Wilcoxon 信号-Rank测试,我们的方法在评估的假设情景中,在75 ⁇ 中始终高于普通的10倍交叉校验。