Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on a sequential covering - a well established heuristic for decision rule induction. The fusion of multiple passes accompanied with an attribute penalization scheme allows generation of contrast sets describing same examples with different attributes, distinguishing presented approach from the standard sequential covering. The ability to identify contrast sets in regression and survival data sets, the feature not provided by the existing algorithms, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: Contrast sets, Sequential covering, Model fusion, Rule induction, Regression, Survival, Knowledge discovery
翻译:识别群体之间差异是最重要的知识发现问题之一。 程序,又称为对比组采矿,适用于医学、工业或经济学等广泛领域。 在我们介绍的论文中,我们介绍了SurKit-CS, 一种基于连续覆盖的对比组采矿算法,这是基于连续覆盖的对比组采矿的计算法,这是决策规则诱导的既定超常结构。 结合多种通行证加上属性处罚制度,可以产生对比组,描述具有不同属性的相同实例,区别于标准顺序覆盖。 显示回归和生存数据集中的对比组的能力,现有算法没有提供的特征,进一步扩大规则Kit-CS的可用性。 对不同领域的130多个数据集的实验和对选定案例的详细分析证实规则-CScS是发现特定群体之间差异的有用工具。 算法作为GitHub GUNUAGPL 3许可证下可使用的规则组套件的一部分(https://github.com/ada-polsl/Servic Kit) 。关键词:对比组、序列覆盖、模型、认知、认知、感知、感知、感知、感征、回归)