Adapting machine learning algorithms to better handle the presence of natural clustering or batch effects within training datasets is imperative across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a single dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We denote our novel approach as the Cross-Cluster Weighted Forest, and examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data-partitioning and ensemble weighting strategies on conferring the benefits of our method over the existing paradigm. Finally, we apply our approach to cancer molecular profiling and gene expression datasets that are naturally divisible into clusters and illustrate that our approach outperforms classic Random Forest. Code and supplementary material are available at https://github.com/m-ramchandran/cross-cluster.
翻译:修改机器学习算法以更好地处理培训数据集中的自然集群或批量效应的存在,是各种生物应用中十分必要的。本条款考虑了将随机森林学习者聚集在一起,在单一数据集内进行集群培训,在地貌分布上各异。我们发现,根据k手段等算法确定的集群,建立经过培训的森林群群,可大大改进传统随机森林算法的准确性和可概括性。我们提到我们的新颖方法,如跨Cluster Weighted Forest, 并审查它对各种数据生成情景和结果模型的稳健性。此外,我们还探讨了数据分割和共同加权战略对赋予我们方法优于现有模式的惠益的影响。最后,我们运用了我们的方法,将自然可辨别到的癌症分子剖析和基因表达数据集应用于各组,说明我们的方法超越了经典随机森林。https://github.com/m-ramchandran/crosy-commission。和补充材料可在https://github.com/m-ramchranran-cround。