Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method outperforms several state-of-the-art methods on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
翻译:确定最能描述数据集的数学关系仍然是机器学习中的一个非常具有挑战性的问题,并被称为“符号回归”。 与通常被视为黑盒的神经网络相比,斯洛伐克共和国试图通过汇集分析功能来深入了解独立变量与特定数据集目标变量之间的根本关系。在本文件中,我们介绍了通用的符号回归方法,即通用的符号回归方法,它修改了常规的SR优化问题配方,同时保持了SR的主要目标的完整。在GSR中,我们推断独立变量与目标变量的某些变异之间的数学关系。我们将我们的搜索空间限制在基准功能的加权总和上,并提出了一个基于矩阵编码的基因编程方法。我们表明,我们的GSR方法在众所周知的SR基准问题组上优于一些最先进的方法。最后,我们通过介绍SymSet,我们强调GSR的优势。 SymSet是一套与现有基准相比更具挑战性的新的SR基准集。