选择稳健特征进行机器学习应用的多数据因果发现 (Selecting Robust Features for Machine Learning Applications using Multidata Causal Discovery)

Robust feature selection is vital for creating reliable and interpretable Machine Learning (ML) models. When designing statistical prediction models in cases where domain knowledge is limited and underlying interactions are unknown, choosing the optimal set of features is often difficult. To mitigate this issue, we introduce a Multidata (M) causal feature selection approach that simultaneously processes an ensemble of time series datasets and produces a single set of causal drivers. This approach uses the causal discovery algorithms PC1 or PCMCI that are implemented in the Tigramite Python package. These algorithms utilize conditional independence tests to infer parts of the causal graph. Our causal feature selection approach filters out causally-spurious links before passing the remaining causal features as inputs to ML models (Multiple linear regression, Random Forest) that predict the targets. We apply our framework to the statistical intensity prediction of Western Pacific Tropical Cyclones (TC), for which it is often difficult to accurately choose drivers and their dimensionality reduction (time lags, vertical levels, and area-averaging). Using more stringent significance thresholds in the conditional independence tests helps eliminate spurious causal relationships, thus helping the ML model generalize better to unseen TC cases. M-PC1 with a reduced number of features outperforms M-PCMCI, non-causal ML, and other feature selection methods (lagged correlation, random), even slightly outperforming feature selection based on eXplainable Artificial Intelligence. The optimal causal drivers obtained from our causal feature selection help improve our understanding of underlying relationships and suggest new potential drivers of TC intensification.

翻译：稳健的特征选择对于创建可靠且易解释的机器学习（ML）模型至关重要。当领域知识有限且底层交互不明确时，在设计统计预测模型时选择最佳特征集经常是困难的。为了缓解这个问题，我们引入了一种多数据（M）因果特征选择方法，它同时处理一组时间序列数据集，并生成单个因果驱动集。此方法使用Tigramite Python包中实现的PC1或PCMCI的因果发现算法。这些算法利用条件独立性检验来推断因果图的一部分。我们的因果特征选择方法在将剩余的因果特征作为输入传递给进行目标预测的ML模型（多重线性回归，随机森林）之前过滤掉了具有因果性假结构的链接。我们将我们的框架应用于西太平洋热带气旋（TC）的统计强度预测，对于该预测，准确选择驱动因子和其维数约简（时间滞后、垂直级别和面积平均）经常是困难的。在条件独立性检验中使用更严格的显著性阈值有助于消除虚假因果关系，从而帮助ML模型更好地推广到未见过的TC案例中。M-PC1与减少特征数的特征在绩效方面优于M-PCMCI，非因果ML和其他特征选择方法（滞后相关性，随机），甚至略优于基于可解释人工智能的特征选择。我们因果特征选择得到的最佳因果驱动器有助于改进我们对底层关系的理解，并提出了热带气旋强化的新潜在驱动因素。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【干货书】基于统计和机器学习的实用时间序列分析预测，Practical Time Series Analysis Prediction with Statistics & Machine Learning

专知会员服务

143+阅读 · 2022年4月8日

机器学习损失函数概述，Loss Functions in Machine Learning

专知会员服务

83+阅读 · 2022年3月19日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【经典书】使用机器学习R语言，149页pdf，Practical Machine Learning in R

专知会员服务

24+阅读 · 2021年1月13日