We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.
翻译:我们引入了强力分组发现的问题,即找到一组可解释的子集描述,即(1) 突出一个或一个以上目标属性,(2) 在统计上是稳健的,(3) 不冗余。许多尝试都是为了开采当地稳健的分组,或是为了解决模式爆炸,但我们首先从全球建模的角度同时应对这两个挑战。首先,我们为可包含名义或数字变量(包括定义中传统的上层-1分组的发现)的单子分组和多变式目标(即订购的分组)制定广泛的模型类别。这个新型模型类允许我们正式解决使用最小描述值(MDL)原则最佳稳健分组发现的问题,即我们分别为名义和数字目标分别采用最佳常态最大隐隐蔽度和巴耶斯编码。第二,找到最佳分组名单是硬的。因此,我们建议SD++,一种贪婪的基分组数据,可以保证在名义或数字级分组质量定义中发现的最重要分组,在最低定义值分组中显示一个比值的SDL的比值比值, 显示每个基值的比值比值比值比值数据。