Symbolic Regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order-of-magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find Bayesian SR is better these constraints (as the Bayesian prior) than by modifying the fitness function in the GA
翻译:符号回归(SR) 能够产生符合特定数据集的可解释、简明的表达方式,使人比黑盒子方法更能理解结构。增加背景知识(以象征性数学限制的形式),可以产生在理论方面有意义的表达方式,同时又符合数据。我们特别研究对传统基因算法(GA)基于SR(PySR)和以Bayesian SR(MCMC)为基地的Markov-链蒙特卡洛(MCMC)结构(Bayesian机器科学家)的附加限制,这些限制可用于从实验性历史数据集中重新发现吸收方程式。我们发现,虽然硬性限制阻止GA和MCMCSR搜索,但软性限制在搜索有效性和模型富有意义的两方面都会导致业绩的改善,而计算成本因测高而增加。如果这些限制与数据集或预期模型没有关系,则会阻碍对表达方式的搜索。我们发现Bayesian SR(像Bayesian人以前那样)通过修改GA的健身功能,这些制约比贝斯人更好。我们发现,这些限制(像Bayesian人)会更好。