Machine learning is often applied to obtain predictions and new understandings of complex phenomena and relationships, but an availability of sufficient data for model training is a widespread problem. Traditional machine learning techniques, such as random forests and gradient boosting, tend to overfit when working with data sets of only a few hundred observations. This study demonstrates that for small training sets of 250 observations, symbolic regression generalises better to out-of-sample data than traditional machine learning frameworks, as measured by the coefficient of determination $R^2$ on the validation set. In 132 out of 240 cases, symbolic regression achieves a higher $R^2$ than any of the other models on the out-of-sample data. Furthermore, symbolic regression also preserves the interpretability of linear models and decision trees, an added benefit to its superior generalization. The second best algorithm was found to be a random forest, which performs best in 37 of the 240 cases. When restricting the comparison to interpretable models, symbolic regression performs best in 184 out of 240 cases.
翻译:机械学习通常用于获得对复杂现象和关系的预测和新理解,但为模型培训提供充足数据是一个普遍的问题。传统的机械学习技术,如随机森林和梯度增强等,在与仅几百次观测的数据集合作时往往过于适合。本研究表明,对于250次观察的小型培训组而言,象征性回归一般比传统的机器学习框架更适合标本外的数据,这比传统机器学习框架的确定系数2雷亚尔2美元来衡量。在240个案例中,132个案例中,象征性回归比其他模型的抽样数据高出2雷亚尔。此外,象征性回归还保留了线性模型和决定树的可解释性,这是其超常化的一个额外好处。第二个最佳算法被认为是随机森林,在240个案例中有37个表现最佳。在限制可解释模型的比较时,象征性回归在240个案例中有184个表现最佳。