When choosing between competing symbolic models for a data set, a human will naturally prefer the "simpler" expression or the one which more closely resembles equations previously seen in a similar context. This suggests a non-uniform prior on functions, which is, however, rarely considered within a symbolic regression (SR) framework. In this paper we develop methods to incorporate detailed prior information on both functions and their parameters into SR. Our prior on the structure of a function is based on a $n$-gram language model, which is sensitive to the arrangement of operators relative to one another in addition to the frequency of occurrence of each operator. We also develop a formalism based on the Fractional Bayes Factor to treat numerical parameter priors in such a way that models may be fairly compared though the Bayesian evidence, and explicitly compare Bayesian, Minimum Description Length and heuristic methods for model selection. We demonstrate the performance of our priors relative to literature standards on benchmarks and a real-world dataset from the field of cosmology.
翻译:当在数据集的不同符号模型之间进行选择时,人类自然倾向于选择“更简单”的表达式或者与在相似情境中之前看到的方程更相似的方程。这表明函数可能存在非均匀的先验概率,然而这在符号回归(SR)框架中很少被考虑。本文旨在对函数和它们的参数进行详细先验信息的结构化回归建模。我们关于函数结构的先验是基于$n$-gram语言模型计算的,它不仅考虑算子和算子的相对排列顺序,也考虑到算子的出现频率。我们还基于分数Bayes因子提出一种形式主义来处理数值参数先验,以使我们可以用贝叶斯证据公平地比较模型,同时明确比较用于模型选择的贝叶斯、最小描述长度和启发式方法。我们通过考察一些基准数据集和纪实宇宙学的真实数据集,展示了我们的先验性能相对于现有标准。