Symbolic regression (SR) is a challenging task in machine learning that involves finding a mathematical expression for a function based on its values. Recent advancements in SR have demonstrated the efficacy of pretrained transformer-based models for generating equations as sequences, which benefit from large-scale pretraining on synthetic datasets and offer considerable advantages over GP-based methods in terms of inference time. However, these models focus on supervised pretraining goals borrowed from text generation and ignore equation-specific objectives like accuracy and complexity. To address this, we propose TPSR, a Transformer-based Planning strategy for Symbolic Regression that incorporates Monte Carlo Tree Search into the transformer decoding process. TPSR, as opposed to conventional decoding strategies, allows for the integration of non-differentiable feedback, such as fitting accuracy and complexity, as external sources of knowledge into the equation generation process. Extensive experiments on various datasets show that our approach outperforms state-of-the-art methods, enhancing the model's fitting-complexity trade-off, extrapolation abilities, and robustness to noise. We also demonstrate that the utilization of various caching mechanisms can further enhance the efficiency of TPSR.
翻译:在机器学习中,符号回归(SR)是一项具有挑战性的任务,它涉及为基于其价值的函数找到数学表达方式。斯洛伐克共和国最近的进展表明,预先训练的变压器模型具有将等式生成为序列的功效,这些模型受益于合成数据集的大规模预培训,在推算时间方面比基于GP的方法具有相当大的优势。然而,这些模型侧重于从文本生成中借用的受监督的训练前目标,忽视了精确性和复杂性等方程式的特定目标。为了解决这个问题,我们提议采用基于变压器的反射规划战略,将蒙特卡洛树搜索纳入变压器解码过程。TPSR与传统的解码战略相比,能够将非差别的反馈(例如适当的准确性和复杂性)纳入到方程式生成过程中。关于各种数据集的广泛实验表明,我们的方法不符合最新的方法,加强了模型的兼容性交易、外推法能力和对噪音的稳健性。我们还表明,利用各种测距机制可以进一步提高噪音的效率。</s>