Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on a sequential covering - a well established heuristic for decision rule induction. Multiple passes accompanied with an attribute penalization scheme allow generating contrast sets describing same examples with different attributes, unlike the standard sequential covering. The ability to identify contrast sets in regression and survival data sets, the feature not provided by the existing algorithms, further extends the usability of RuleKit-CS. Experiments on wide range of data sets confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm is a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: Contrast sets, Sequential covering, Rule induction, Regression, Survival, Knowledge discovery
翻译:识别群体之间差异是最重要的知识发现问题之一。 程序( 也称为对比组采矿) 适用于医学、工业或经济学等广泛领域。 在我们介绍的论文中,我们展示了规则Kit-CS, 一种基于连续覆盖的对比组采矿算法, 一种基于有序覆盖的对比组采矿算法, 一种公认的决策规则诱导的超常性。 多张通行证加上属性惩罚制度, 产生对比组, 描述具有不同属性的相同例子, 不同于标准的顺序覆盖。 能够识别回归和生存数据集中的对比组, 现有算法没有提供特征, 进一步扩大规则Kit- CS的可用性。 对广泛数据集的实验确认规则Kit- CS是发现特定群体之间差异的有用工具。 算法是GitHub GintKit Ket套件的一部分, 在GNUAGPL 3 许可证下( https://github.com/ada-polsl/RutyKit) 下( ) 关键词: 对比组、 规则引言、 引言、再入、 生存、知识发现)。