Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. In this paper, we address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems, including physics equations and systems of ordinary differential equations. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that deep learning and genetic algorithm-based approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.
翻译:近些年来,提出了许多具有象征性回归的有希望的方法,但实地的进展仍然缺乏统一、稳健和透明的基准标准。在本文件中,我们通过采用开放源码、可复制的象征性回归基准平台来解决这一缺陷。我们评估了14个象征性回归方法和7个机器学习方法,涉及252种不同的回归问题。我们的评估包括没有已知模型形式的真实世界数据集以及地面真相基准问题,包括物理方程式和普通差异方程式系统。对于现实世界数据集,我们衡量每一种方法学习模型的能力,相对于最新机器学习方法而言,错误低,复杂性低。对于合成问题,我们评估每一种方法在252种不同程度的噪声存在的情况下找到精确解决方案的能力。根据这些受控制的实验,我们得出结论,现实世界回归的最佳表现方法结合了参数估计和/或语义搜索驱动因素等遗传算法。当我们负责在噪音出现时恢复精确方程式时,我们发现深层次的学习和遗传算法方法与最先进的机器学习方法,我们以类似的方式进行。我们提供了一种共同的实验。我们提供一种共同的实验,以便重新研究。