A common problem in machine learning is determining if a variable significantly contributes to a model's prediction performance. This problem is aggravated for datasets, such as gene expression datasets, that suffer the worst case of dimensionality: a low number of observations along with a high number of possible explanatory variables. In such scenarios, traditional methods for testing variable statistical significance or constructing variable confidence intervals do not apply. To address these problems, we developed a novel permutation framework for testing the significance of variables in supervised models. Our permutation framework has three main advantages. First, it is non-parametric and does not rely on distributional assumptions or asymptotic results. Second, it not only ranks model variables in terms of relative importance, but also tests for statistical significance of each variable. Third, it can test for the significance of the interaction between model variables. We applied this permutation framework to multi-class classification of the Iris flower dataset and of brain regions in RNA expression data, and using this framework showed variable-level statistical significance and interactions.
翻译:机器学习的一个常见问题是,确定一个变量是否显著地有助于模型的预测性能。对于诸如基因表达式数据集等数据组来说,这个问题更加严重,因为基因表达式数据集等数据集在维度方面最差:观测数量少,而且可能的解释性变量也很多。在这样的情况下,传统的测试可变统计意义或构建可变信任间隔的方法并不适用。为了解决这些问题,我们开发了一个新颖的调整框架,用于测试受监督模型中变量的重要性。我们的调整框架有三个主要优点。首先,它是非参数,不依赖分布性假设或非统计性结果。第二,它不仅按相对重要性排列模型变量,而且还对每个变量的统计意义进行测试。第三,它可以测试模型变量之间相互作用的重要性。我们用这个对伊里斯花数据集和脑区域进行多级分类的框架来测量 RNA 表达性数据,并用这个框架来显示可变的统计意义和互动。