Current limitations in boosted tree modelling prevent the effective scaling to datasets with a large feature number, particularly when investigating the magnitude and directionality of various features on classification. We present a novel methodology, Hollow-tree Super (HOTS), to resolve and visualize feature importance in boosted tree models involving a large number of features. Further, HOTS allows for investigation of the directionality and magnitude various features have on classification. Using the Iris dataset, we first compare HOTS to Gini Importance, Partial Dependence Plots, and Permutation Importance, and demonstrate how HOTS resolves the weaknesses present in these methods. We then show how HOTS can be utilized in high dimensional neuroscientific data, by taking 60 Schizophrenic subjects and applying the method to determine which brain regions were most important for classification of schizophrenia as determined by the PANSS. HOTS effectively replicated and supported the findings of Gini importance, Partial Dependence Plots and Permutation importance within the Iris dataset. When applied to the schizophrenic brain dataset, HOTS was able to resolve the top 10 most important features for classification, as well as their directionality for classification and magnitude compared to other features. Cross-validation supported that these same 10 features were consistently used in the decision-making process across multiple trees, and these features were localised primarily to the occipital and parietal cortices, commonly disturbed brain regions in those with Schizophrenia. It is imperative that a methodology is developed that is able to handle the demands of working with large datasets that contain a large number of features. HOTS represents a unique way to investigate both the directionality and magnitude of feature importance when working at scale with boosted-tree modelling.
 翻译:强化树建模的当前局限性阻止了向具有大量特征的数据集的有效缩放,特别是在调查分类中各种特征的规模和方向性时。我们展示了一种新的方法,即超级谷树组织(HOTS),用以解决和直观展示涉及大量特征的树型增殖模型中的特征重要性。此外,HOTS还有助于调查方向性和规模不同特征的分类。利用Iris数据集,我们首先将HOTS与基尼重要性、部分依赖性图案和变异重要性进行比较,并展示HOTS如何解决这些方法中存在的弱点。然后我们展示HOTS如何在高度神经科学数据中使用。我们展示了一种新型HOTS系统,通过采用60个心科学科,并运用这种方法确定哪些大脑区域对脊髓分裂症的分类最为重要。SHATS有效地复制和支持了Gini重要性、部分依赖性图案和变异性数值在Iris数据集中的重要性。当应用于精神分裂和大脑结构的高度特征应用时,HOTS系统主要能够将大脑结构的特性用于对10个层次的高度分析,这些特性进行比较,这些方法用来决定的高度分析,这些特性在10的高度分析中,这些特性的高度结构结构结构结构结构结构的高度分析过程的高度都能够确定。