Machine learning in high-stakes domains, such as healthcare, faces two critical challenges: (1) generalizing to diverse data distributions given limited training data while (2) maintaining interpretability. To address these challenges, we propose an instance-weighted tree-sum method that effectively pools data across diverse groups to output a concise, rule-based model. Given distinct groups of instances in a dataset (e.g., medical patients grouped by age or treatment site), our method first estimates group membership probabilities for each instance. Then, it uses these estimates as instance weights in FIGS (Tan et al. 2022), to grow a set of decision trees whose values sum to the final prediction. We call this new method Group Probability-Weighted Tree Sums (G-FIGS). G-FIGS achieves state-of-the-art prediction performance on important clinical datasets; e.g., holding the level of sensitivity fixed at 92%, G-FIGS increases specificity for identifying cervical spine injury by up to 10% over CART and up to 3% over FIGS alone, with larger gains at higher sensitivity levels. By keeping the total number of rules below 16 in FIGS, the final models remain interpretable, and we find that their rules match medical domain expertise. All code, data, and models are released on Github.
翻译:在保健等高取量领域学习机器面临两大挑战:(1) 普及不同的数据分布,提供有限的培训数据,同时(2) 保持可解释性; 为应对这些挑战,我们建议采用实例加权的树总法,将不同群体的数据有效汇集到不同的群体,以产生一个简洁、有章可循的模式。鉴于数据集中存在不同的情况(如按年龄或治疗地点分类的病人),我们的方法首先估计每个类的敏感性程度为92%,然后,G-FIGS将这些估计数用作FIGS(Tan等人,2022)的例重,以培育一套其价值与最后预测相匹配的决策树群。我们称之为“可预见性树群”(G-FIGS)。G-FIGS在重要的临床数据集中实现了最先进的预测性表现;例如,将敏感度保持在92%,G-FIGS将确定宫颈脊损伤的精确度提高到10%以上,而仅与FIGS相比提高到3%,在最后预测值上,我们称之为“可更高敏感度”的新的方法组组(GIGS),在最后规则中仍保持其整个理解性,并在GIFI模型中保持其最后规则的准确性。