EDC：面向分类任务的方程发现方法 (EDC: Equation Discovery for Classification)

from arxiv, This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Lecture Notes in Computer Science, and is available online at https://doi.org/10.1007/978-3-032-05461-6_9

Equation Discovery techniques have shown considerable success in regression tasks, where they are used to discover concise and interpretable models (\textit{Symbolic Regression}). In this paper, we propose a new ED-based binary classification framework. Our proposed method EDC finds analytical functions of manageable size that specify the location and shape of the decision boundary. In extensive experiments on artificial and real-life data, we demonstrate how EDC is able to discover both the structure of the target equation as well as the value of its parameters, outperforming the current state-of-the-art ED-based classification methods in binary classification and achieving performance comparable to the state of the art in binary classification. We suggest a grammar of modest complexity that appears to work well on the tested datasets but argue that the exact grammar -- and thus the complexity of the models -- is configurable, and especially domain-specific expressions can be included in the pattern language, where that is required. The presented grammar consists of a series of summands (additive terms) that include linear, quadratic and exponential terms, as well as products of two features (producing hyperbolic curves ideal for capturing XOR-like dependencies). The experiments demonstrate that this grammar allows fairly flexible decision boundaries while not so rich to cause overfitting.

翻译：方程发现技术在回归任务中已展现出显著成效，常用于发现简洁且可解释的模型（即符号回归）。本文提出一种基于方程发现的新型二分类框架。我们提出的EDC方法能够找到规模可控的解析函数，这些函数可明确决策边界的位置与形状。在大量人工数据与真实数据的实验中，我们证明了EDC不仅能发现目标方程的结构，还能确定其参数值，在二分类任务中超越了当前基于方程发现的先进分类方法，并达到了与二分类领域最优方法相当的性能。我们提出了一种复杂度适中的语法规则，该规则在测试数据集上表现良好，同时指出具体语法规则（进而决定模型复杂度）是可配置的，尤其可根据需要将领域特定表达式纳入模式语言中。所提出的语法由一系列加和项构成，包含线性项、二次项、指数项以及双特征乘积项（可生成适用于捕捉类XOR依赖关系的双曲线）。实验表明，该语法能够在保持决策边界足够灵活性的同时，避免因过于复杂而导致过拟合。