We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global perspective. First, we formulate a broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables. This novel model class allows us to formalize the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalized Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Notably, we show that our problem definition is equal to mining the top-1 subgroup with an information-theoretic quality measure plus a penalty for complexity. Second, as finding optimal subgroup lists is NP-hard, we propose RSD, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that RSD outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.
翻译:我们引入了强力分组发现的问题, 即, 找到一组可解释的子集描述, 1) 在一个或多个目标属性方面最为突出, 2在统计上是稳健的, 3在统计上是稳健的。 许多尝试都是为了开采当地稳健的分组, 或者是为了解决模式爆炸, 但我们是第一个同时从全球角度应对这两个挑战的。 首先, 我们设计了一个范围很广的分组分组分类模型, 即, 定购的分组, 可用于由名义变量或数字变量组成的非静态和多变量。 这个新型模型类让我们能够用最低描述长度( MDL) 原则来正式确定最佳稳健的分组发现问题。 在那里, 我们采用优化的常态最大隐隐性最小缩略和巴伊斯编码来分别处理名义和数字目标。 值得注意的是, 我们的问题定义相当于以信息- 理论质量衡量和复杂度加惩罚的方式开采上层分组。 其次, 在找到最佳的分组清单时, 我们建议 RSD, 贪婪的基质分组发现 和 等值分组, 显示每个比值分组的分组 显示一个显著的基数 标准 显示一个比值 。