In this paper, we are revisiting pattern mining and especially itemset mining, which allows one to analyze binary datasets in searching for interesting and meaningful association rules and respective itemsets in an unsupervised way. While a summarization of a dataset based on a set of patterns does not provide a general and satisfying view over a dataset, we introduce a concise representation -- the closure structure -- based on closed itemsets and their minimum generators, for capturing the intrinsic content of a dataset. The closure structure allows one to understand the topology of the dataset in the whole and the inherent complexity of the data. We propose a formalization of the closure structure in terms of Formal Concept Analysis, which is well adapted to study this data topology. We present and demonstrate theoretical results, and as well, practical results using the GDPM algorithm. GDPM is rather unique in its functionality as it returns a characterization of the topology of a dataset in terms of complexity levels, highlighting the diversity and the distribution of the itemsets. Finally, a series of experiments shows how GDPM can be practically used and what can be expected from the output.
翻译:在本文中,我们正在重新审视模式采矿,特别是项目集采矿,这样,人们就可以以不受监督的方式分析在寻找有趣和有意义的关联规则和各个项目时的二元数据集。根据一组模式对数据集进行汇总并不提供对数据集的一般和令人满意的观点,但我们在封闭项目及其最小生成器的基础上采用简洁的表述方式 -- -- 封闭结构 -- -- 来捕捉数据集的内在内容。封闭结构使人们能够理解数据集整体的表层和数据固有的复杂性。我们建议正式化正式概念分析结构,该结构非常适合研究这一数据表层。我们介绍并展示理论结果,以及使用GDPM算法的实际结果。GDPM在功能上相当独特,因为它在复杂水平上可以反映数据集的表层特征,突出项目的多样性和分布。最后,一系列实验表明GDPM是如何实际使用的,并且可以从产出中预期到什么。