可解释的机器学习在PySR与SymbolicRegression.jl中的应用 (Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl)

from arxiv, 24 pages, 5 figures, 3 tables. Feedback welcome. Paper source found at https://github.com/MilesCranmer/pysr_paper ; PySR at https://github.com/MilesCranmer/PySR ; SymbolicRegression.jl at https://github.com/MilesCranmer/SymbolicRegression.jl

PySR is an open-source library for practical symbolic regression, a type of machine learning which aims to discover human-interpretable symbolic models. PySR was developed to democratize and popularize symbolic regression for the sciences, and is built on a high-performance distributed back-end, a flexible search algorithm, and interfaces with several deep learning packages. PySR's internal search algorithm is a multi-population evolutionary algorithm, which consists of a unique evolve-simplify-optimize loop, designed for optimization of unknown scalar constants in newly-discovered empirical expressions. PySR's backend is the extremely optimized Julia library SymbolicRegression.jl, which can be used directly from Julia. It is capable of fusing user-defined operators into SIMD kernels at runtime, performing automatic differentiation, and distributing populations of expressions to thousands of cores across a cluster. In describing this software, we also introduce a new benchmark, "EmpiricalBench," to quantify the applicability of symbolic regression algorithms in science. This benchmark measures recovery of historical empirical equations from original and synthetic datasets.

翻译：PySR是一个开源库，用于实用的符号回归，这是一种旨在发现人类可解释的符号模型的机器学习方法。 PySR的目标是通过高性能的分布式后端、灵活的搜索算法和与几个深度学习包的接口来将符号回归民主化和普及到科学领域。 PySR的内部搜索算法是一种多种群进化算法，由独特的进化-简化-优化循环组成，旨在优化新发现的经验表达式中的未知标量常数。 PySR的后端是极其优化的Julia库SymbolicRegression.jl，它可以直接从Julia中使用。它能够将用户定义的运算符在运行时融合到SIMD内核中，执行自动微分，并将表达式的人口分布到集群上的数千个核心。在描述这个软件的同时，我们还引入了一个新的基准测试，"EmpiricalBench"，来量化符号回归算法在科学中的适用性。这个基准测试衡量了从原始和合成数据集中恢复历史经验方程的能力。

相关内容

搜索算法

关注 61

搜索算法是利用计算机的高性能来有目的的穷举一个问题解空间的部分或所有的可能情况，从而求出问题的解的一种方法。现阶段一般有枚举算法、深度优先搜索、广度优先搜索、A*算法、回溯算法、蒙特卡洛树搜索、散列函数等算法。在大规模实验环境中，通常通过在搜索前，根据条件降低搜索规模；根据问题的约束条件进行剪枝；利用搜索过程中的中间解，避免重复计算这几种方法进行优化。

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日