符号回归优于小型数据集的其他模型 (Symbolic regression outperforms other models for small data sets)

Machine learning is often applied to obtain predictions and new understandings of complex phenomena and relationships, but an availability of sufficient data for model training is a widespread problem. Traditional machine learning techniques, such as random forests and gradient boosting, tend to overfit when working with data sets of only a few hundred observations. This study demonstrates that for small training sets of 250 observations, symbolic regression generalises better to out-of-sample data than traditional machine learning frameworks, as measured by the coefficient of determination $R^2$ on the validation set. In 132 out of 240 cases, symbolic regression achieves a higher $R^2$ than any of the other models on the out-of-sample data. Furthermore, symbolic regression also preserves the interpretability of linear models and decision trees, an added benefit to its superior generalization. The second best algorithm was found to be a random forest, which performs best in 37 of the 240 cases. When restricting the comparison to interpretable models, symbolic regression performs best in 184 out of 240 cases.

翻译：机械学习通常用于获得对复杂现象和关系的预测和新理解,但为模型培训提供充足数据是一个普遍的问题。传统的机械学习技术,如随机森林和梯度增强等,在与仅几百次观测的数据集合作时往往过于适合。本研究表明,对于250次观察的小型培训组而言,象征性回归一般比传统的机器学习框架更适合标本外的数据,这比传统机器学习框架的确定系数2雷亚尔2美元来衡量。在240个案例中,132个案例中,象征性回归比其他模型的抽样数据高出2雷亚尔。此外,象征性回归还保留了线性模型和决定树的可解释性,这是其超常化的一个额外好处。第二个最佳算法被认为是随机森林,在240个案例中有37个表现最佳。在限制可解释模型的比较时,象征性回归在240个案例中有184个表现最佳。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

应用机器学习书稿，361页pdf

专知会员服务

59+阅读 · 2020年11月24日

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【UIUC硬核书】统计学习理论，Statistical Learning Theory，213页pdf

专知会员服务

134+阅读 · 2020年4月14日

【ICLR2020-Facebook 2020】深度学习符号化数学，Deep Learning for Symbolic Mathematics，

专知会员服务

23+阅读 · 2020年4月7日