We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, and that includes traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, as finding optimal subgroup lists is NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that SSD++ outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.
翻译:我们引入了强力分组发现的问题,即找到一组可解释的子集描述,即(1) 突出一个或一个以上目标属性,(2) 统计上稳健,(3) 非冗余。许多尝试都是为了开采当地稳健的分组或处理模式爆炸,但我们首先从全球建模的角度同时应对这两个挑战。首先,我们为可包含名义或数字变量的未审定和多变量变量目标,找到一套可解释的分组和多变量目标,并在定义中包括传统的上层-1分组的发现。这个新型模型类别使我们能够利用最低描述(MDL)原则,正式解决最佳稳健分组发现的问题,即我们分别从全球建模角度对名义和数字目标采用最佳常态最大相似度和Bayesian编码。第二,在寻找最佳分组名单时,我们提议SDD++, 一种贪婪的基分组清单,在定义定义中找到一个符合MDL值的最重要的分组, 基数的基数分组发现一个比值分组,在标准中显示一个比值的比值比值比值比值比值比值比值标准的SBA和比值的多级的基值。