In the statistical literature, sparse modeling is the standard approach to achieve improvements in prediction tasks and interpretability. Alternatively, in the seminal paper "Statistical Modeling: The Two Cultures," Breiman (2001) advocated for the adoption of algorithmic approaches to generate ensembles to achieve superior prediction accuracy than single-model methods at the cost of loss of interpretability. In a recent important and critical paper, Rudin (2019) argued that blackbox algorithmic approaches should be avoided for high-stakes decisions and that the tradeoff between accuracy and interpretability is a myth. In response to this recent change in philosophy, we generalize best subset selection (BSS) to best split selection (BSpS), a data-driven approach aimed at finding the optimal split of predictor variables among the models of an ensemble. The proposed methodology results in an ensemble of sparse and diverse models that provide possible mechanisms that explain the relationship between the predictors and the response. The high computational cost of BSpS motivates the need for computational tractable ways to approximate the exhaustive search, and we benchmark one such recent proposal by Christidis et al. (2020) based on a multi-convex relaxation. Our objective with this article is to motivate research in this new exciting field with great potential for data analysis tasks for high-dimensional data.
翻译:在统计文献中,稀有的模型是改进预测任务和可解释性的标准方法。或者,在“统计模型:两文化”这一开创性论文中,布雷曼(2001年)主张采用算法方法,产生比单一模型方法更精准的预测精度,以失去可解释性为代价。鲁丁(2019年)在最近的一份重要和批评性论文中认为,应当避免黑盒算法方法,以便作出高发决定,准确性和可解释性之间的权衡是一种神话。为了应对最近哲学的这一变化,我们推广了最佳子集选择(BS),以最佳的分选(BSPS)为最佳分选(BSPS),这是一种数据驱动的方法,目的是在共同模型模型中找到最佳的预测或变量组合组合,以损失可解释性为代价。在一份重要模型中,可以提供解释预测者与答复者之间关系的可能机制。BSpS的计算成本很高,这说明有必要用可计算的方法来接近彻底的搜索,我们将最佳的子集选择(BS)和最佳分选方法,我们将最近提出的一项由Christidi et et al (20年) 高层次研究中的数据与这项目标作为基准。