The growing volume of data makes the use of computationally intense machine learning techniques such as symbolic regression with genetic programming more and more impractical. This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. The data is aggregated in a preprocessing step before running the actual machine learning algorithm. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. We analyze the achieved speed-up in training and the effects on the trained models test accuracy for every method on four real-world data sets. The performance of genetic programming is compared with random forests and linear regression. It is shown, that k-means and random sampling lead to very small loss in test accuracy when the data is reduced down to only 30% of the original data, while the speed-up is proportional to the size of the data set. Binning on the contrary, leads to models with very high test error.
翻译:数据数量不断增加,使得使用计算密集的机器学习技术,如通过基因编程进行象征性回归,越来越不切实际。 这项工作讨论了减少培训数据的方法, 从而也讨论了基因编程的运行时间。 数据在实际机器学习算法运行前的预处理步骤中汇总。 K 表示群集和数据集集, 与随机抽样比较, 作为最简单的数据减少方法。 我们分析了培训中实现的加速, 以及四种真实世界数据集中每种方法对经过培训的模型测试精确度的影响。 基因编程的性能与随机森林和线性回归相比较。 事实证明, k 表示方式和随机取样导致测试准确性在数据降至仅原始数据的30%时, 测试准确性损失很小, 而加速率则与数据集的大小成正比。 相反, 我们分析在培训中实现的加速度, 以及对四个真实世界数据集中每套方法经过培训的模型测试精确度的影响。 基因编程的性能与随机森林和线性回归相比。 显示, k 方式和随机取样和随机取样导致测试误差极小的模型。