Frequent Item-set Mining (FIM), sometimes called Market Basket Analysis (MBA) or Association Rule Learning (ARL), are Machine Learning (ML) methods for creating rules from datasets of transactions of items. Most methods identify items likely to appear together in a transaction based on the support (i.e. a minimum number of relative co-occurrence of the items) for that hypothesis. Although this is a good indicator to measure the relevance of the assumption that these items are likely to appear together, the phenomenon of very frequent items, referred to as ubiquitous items, is not addressed in most algorithms. Ubiquitous items have the same entropy as infrequent items, and not contributing significantly to the knowledge. On the other hand, they have strong effect on the performance of the algorithms and sometimes preventing the convergence of the FIM algorithms and thus the provision of meaningful results. This paper discusses the phenomenon of ubiquitous items and demonstrates how ignoring these has a dramatic effect on the computation performances but with a low and controlled effect on the significance of the results.
翻译:常见物品采矿(FIM)有时称为市场篮子分析(MBA)或协会规则学习(ARL),是用项目交易数据集创建规则的机械学习(ML)方法,多数方法根据这一假设依据支持(即物品的相对共发次数最少)确定可能同时出现的物品,虽然这是衡量这些物品可能同时出现这一假设的相关性的一个良好指标,但在大多数算法中,被称为无处不在的物品的非常频繁物品现象并没有被处理。超纯物品具有与不常见物品相同的灵敏性,对知识贡献不大。另一方面,它们对于算法的性能有很强的影响,有时阻碍FIM算法的趋同,从而阻止提供有意义的结果。本文讨论了无处不在物品的现象,并表明忽略这些物品对计算性能产生显著影响,但对于结果的意义影响较小,控制效果不大。