Robust feature selection is vital for creating reliable and interpretable Machine Learning (ML) models. When designing statistical prediction models in cases where domain knowledge is limited and underlying interactions are unknown, choosing the optimal set of features is often difficult. To mitigate this issue, we introduce a Multidata (M) causal feature selection approach that simultaneously processes an ensemble of time series datasets and produces a single set of causal drivers. This approach uses the causal discovery algorithms PC1 or PCMCI that are implemented in the Tigramite Python package. These algorithms utilize conditional independence tests to infer parts of the causal graph. Our causal feature selection approach filters out causally-spurious links before passing the remaining causal features as inputs to ML models (Multiple linear regression, Random Forest) that predict the targets. We apply our framework to the statistical intensity prediction of Western Pacific Tropical Cyclones (TC), for which it is often difficult to accurately choose drivers and their dimensionality reduction (time lags, vertical levels, and area-averaging). Using more stringent significance thresholds in the conditional independence tests helps eliminate spurious causal relationships, thus helping the ML model generalize better to unseen TC cases. M-PC1 with a reduced number of features outperforms M-PCMCI, non-causal ML, and other feature selection methods (lagged correlation, random), even slightly outperforming feature selection based on eXplainable Artificial Intelligence. The optimal causal drivers obtained from our causal feature selection help improve our understanding of underlying relationships and suggest new potential drivers of TC intensification.
翻译:为创建可靠和可解释的机器学习模型,选择稳健的特征至关重要。在设计统计预测模型时,由于领域知识有限且基础交互未知,选择最优特征集通常很困难。为解决这个问题,我们引入了一种多元数据因果特征选择方法(M)。该方法同时处理一组时间序列数据集,并产生一组因果驱动因子。该方法使用Tigramite Python软件包中实现的PC1或PCMCI因果发现算法。这些算法利用条件独立性检测来推断因果图的一部分。我们的因果特征选择方法在将其余因果特征过滤为输入机器学习模型(多元线性回归,随机森林)之前过滤出因果伪关系。我们将我们的框架应用于预测西太平洋热带气旋的统计强度,通常很难准确选择驱动因子及其降维度(时间滞后、垂直层级和区域平均)。在条件独立性检验中使用更严格的显著性阈值有助于消除伪因果关系,从而更好地帮助机器学习模型推广到未见的热带气旋案例。M-PC1具有较少特征数的选择优于M-PCMCI、非因果型机器学习和其他特征选择方法(滞后相关性、随机),甚至略优于可解释机器学习的特征选择。从因果特征选择中获得的最佳因果驱动因子有助于改善我们对潜在关系的理解,并提出TC强化的新潜在驱动因子。