不同iable的遗传编程用于高维符号回归 (Differentiable Genetic Programming for High-dimensional Symbolic Regression)

Symbolic regression (SR) is the process of discovering hidden relationships from data with mathematical expressions, which is considered an effective way to reach interpretable machine learning (ML). Genetic programming (GP) has been the dominator in solving SR problems. However, as the scale of SR problems increases, GP often poorly demonstrates and cannot effectively address the real-world high-dimensional problems. This limitation is mainly caused by the stochastic evolutionary nature of traditional GP in constructing the trees. In this paper, we propose a differentiable approach named DGP to construct GP trees towards high-dimensional SR for the first time. Specifically, a new data structure called differentiable symbolic tree is proposed to relax the discrete structure to be continuous, thus a gradient-based optimizer can be presented for the efficient optimization. In addition, a sampling method is proposed to eliminate the discrepancy caused by the above relaxation for valid symbolic expressions. Furthermore, a diversification mechanism is introduced to promote the optimizer escaping from local optima for globally better solutions. With these designs, the proposed DGP method can efficiently search for the GP trees with higher performance, thus being capable of dealing with high-dimensional SR. To demonstrate the effectiveness of DGP, we conducted various experiments against the state of the arts based on both GP and deep neural networks. The experiment results reveal that DGP can outperform these chosen peer competitors on high-dimensional regression benchmarks with dimensions varying from tens to thousands. In addition, on the synthetic SR problems, the proposed DGP method can also achieve the best recovery rate even with different noisy levels. It is believed this work can facilitate SR being a powerful alternative to interpretable ML for a broader range of real-world problems.

翻译：符号回归（SR）是从数学表达式中发现数据中的隐藏关系的过程，被认为是实现可解释性机器学习的有效方式。遗传编程（GP）在解决SR问题方面一直是占优势的。然而，随着SR问题规模的增加，GP往往表现不佳，无法有效地解决现实世界的高维问题。这种限制主要是由传统GP构造树的随机演化性质所造成的。在本文中，我们提出了一种名为DGP的可微方法来构建GP树，以解决高维符号回归问题。具体而言，提出一种称为可微符号树的新数据结构，将离散结构松弛为连续的，因此可以为效率优化器提供梯度。此外，提出一种采样方法，以消除由上述松弛导致的有效符号表达式的差异。此外，引入了一个多样化机制，以促进优化器从局部最优解中逃脱，寻找全局更好的解决方案。通过这些设计，所提出的DGP方法可以高效地搜索具有更高性能的GP树，从而有能力处理高维SR问题。为了展示DGP的有效性，我们针对基于GP和深度神经网络的状态进行了各种实验。实验结果表明，DGP可以在高维回归基准测试中击败这些选择的同行竞争者，维度从数十到数千不等。此外，在人造SR问题上，所提出的DGP方法即使在不同的噪声水平下，也可以实现最佳恢复率。相信这项工作可以促进SR成为一个更广泛的实际问题的可解释性机器学习强有力的选择。