Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on separate and conquer - a well established heuristic for decision rule induction. Multiple passes accompanied with an attribute penalization scheme provide contrast sets describing same examples with different attributes, distinguishing presented approach from the standard separate and conquer. The algorithm was also generalized for regression and survival data allowing identification of contrast sets whose label attribute/survival prognosis is consistent with the label/prognosis for the predefined contrast groups. This feature, not provided by the existing approaches, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: contrast sets, separate and conquer, regression, survival
翻译:识别不同群体之间差异是最重要的知识发现问题之一。 程序, 也称为对比组采矿, 适用于医学、 工业或经济学等广泛领域。 在我们介绍的论文中, “ 规则Kit-CS” 是一种基于分别和征服的对比组采矿算法, 这是一种公认的决策规则诱导的超常现象。 多张带属性惩罚制度的对比组提供了描述具有不同属性的相同实例的对比组, 区别了标准独立和征服的方法。 算法也用于回归和生存数据, 从而能够识别标签属性/ 生存状态预定义对比组的对比组。 这个特征不是由现有方法提供的,而是进一步扩大了“规则Kit- CS”的可用性。 对不同领域的130多套数据集的实验和对选定案例的详细分析证实“规则Kit-CS”是发现特定群体之间差异的有用工具。 该算法是GNU AGPL 3许可证( https://github.com/adapolis)下GULUK- Revigards surgards)下Github 3许可(http:// grevieward)下可使用的规则Keptragard- supliviquestations) 。</s>