This paper proposes FREEtree, a tree-based method for high dimensional longitudinal data with correlated features. Popular machine learning approaches, like Random Forests, commonly used for variable selection do not perform well when there are correlated features and do not account for data observed over time. FREEtree deals with longitudinal data by using a piecewise random effects model. It also exploits the network structure of the features by first clustering them using weighted correlation network analysis, namely WGCNA. It then conducts a screening step within each cluster of features and a selection step among the surviving features, that provides a relatively unbiased way to select features. By using dominant principle components as regression variables at each leaf and the original features as splitting variables at splitting nodes, FREEtree maintains its interpretability and improves its computational efficiency. The simulation results show that FREEtree outperforms other tree-based methods in terms of prediction accuracy, feature selection accuracy, as well as the ability to recover the underlying structure.
翻译:本文提出FREETree,这是具有相关特征的高维纵向数据的一种基于树的方法; 普通机器学习方法,如随机森林,通常用于变量选择,在有相关特征时效果不佳,而且没有说明一段时间内观察到的数据; FREETree 使用小片随机效应模型处理纵向数据,还利用特征的网络结构,首先利用加权相关网络分析(即WGCNA)对特征进行分组,然后在每个特征群中进行筛选,并在生存特征中选择一个步骤,为选择特征提供了相对公正的方式; 利用主导原则组成部分作为每个叶叶的回归变量,以及最初的特征作为分裂节点的分裂变量,FREETree保持其可解释性,并提高其计算效率; 模拟结果表明FREETree在预测准确性、特征选择准确性以及恢复基本结构的能力方面优于其他以树为基础的方法。