Random Forests (RF) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative Random Forests (iRF) use a tree ensemble from iteratively modified RF to obtain predictive and stable non-linear high-order Boolean interactions of features. They have shown great promise for high-order biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover high-order feature interactions are missing. In this paper, to enable such theoretical studies, we first introduce a novel discontinuous nonlinear regression model, called Locally Spiky Sparse (LSS) model, which is inspired by the thresholding behavior in many biological processes. Specifically, LSS model assumes that the regression function is a linear combination of piece-wise constant Boolean interaction terms. We define a quantity called depth-weighted prevalence (DWP) for a set of signed features S and a given RF tree ensemble. We prove that, with high probability under the LSS model, DWP of S attains a universal upper bound that does not involve any model coefficients, if and only if S corresponds to a union of Boolean interactions in the LSS model. As a consequence, we show that RF yields consistent interaction discovery under the LSS model. Simulation results show that DWP can recover the interactions under the LSS model even when some assumptions such as the uniformity assumption are violated.
翻译:随机森林(RF)在预测性能方面处于受监督的机器学习的最前沿,特别是在基因组学方面。迭代随机森林(IRF)使用迭代修改的RF的树形组合组合,以获得各种特征的预测性和稳定的非线性高阶波波级互动。它们显示了高阶生物互动发现的巨大前景,这是推进功能基因组学和精密医学的核心。然而,关于基于树基方法如何发现高序特征互动的理论研究缺乏。在本文中,为了能够进行这种理论研究,我们首先采用了一种新的非线性非线性非线性回归模型,称为地方性Spiky Sprass(LSS)模型,这是由许多生物过程的临界行为所启发的。具体地说,LSS模型假设的回归功能是高档生物互动。我们定义了一个叫做深度加权普及率(DWP)的量,对于一套模式S和给的RF树群联的共性互动。我们证明,在LWP的模型下极有可能出现不连续的LFA值,如果我们在SBO的模型下,那么,在S的回动性互动结果中,那么,在SBOA的模型中,我们只能显示S的模型中的任何结果,那么一个SBA值能的回回回回回回回回的回的回的回的回的回。