Adapting machine learning algorithms to better handle clustering or batch effects within training data sets is important across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a single data set with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We denote our novel approach as the Cross-Cluster Weighted Forest, and examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data-partitioning and ensemble weighting strategies the benefits of our method over the existing paradigm. Finally, we apply our approach to cancer molecular profiling and gene expression data sets that are naturally divisible into clusters and illustrate that our approach outperforms classic Random Forest. Code and supplementary material are available at https://github.com/m-ramchandran/cross-cluster.
翻译:在培训数据集内调整机器学习算法以更好地处理集群或批量效应,对于各种各样的生物应用非常重要。本条款考虑了将随机森林学员在集群上受训的集合纳入单一数据集的影响,在特征分布上各异。我们发现,根据k- means等算法确定的集群建立经过培训的森林群群,可以大大改进传统随机森林算法的准确性和普遍性。我们用跨集群森林的标志来表示我们的新颖方法,并审查它是否牢固地适应了各种数据生成的设想和结果模型。此外,我们探索了数据分割和组合加权战略的影响,我们的方法对现有模式的好处。最后,我们运用了我们的方法,将自然可变异到集群的癌症分子剖析和基因表达数据集,说明我们的方法超越了典型随机森林。我们的方法和补充材料见http://github.com/m-chandran/croscrosty- groupram。