The rapid and continuous growth of data has increased the need for scalable mining algorithms in unsupervised learning and knowledge discovery. In this paper, we focus on Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck. We examine generic dataset modeling techniques and show how they can be used to improve SPM algorithms in time and memory usage. In particular, we develop trie-based dataset models and associated mining algorithms that can represent as well as effectively mine orders of magnitude larger datasets compared to the state of the art. Numerical results on real-life large-size test instances show that our algorithms are also faster and more memory efficient in practice.
翻译:数据迅速和持续增长增加了在不受监督的学习和知识发现中采用可扩展的采矿算法的需要。在本文中,我们着重讨论了在知识发现中面临众所周知的记忆瓶颈的一个基本专题,即序列模式采矿。我们研究了通用数据集模型技术,并展示了如何利用这些技术改进时间和记忆使用方面的SPM算法。特别是,我们开发了三重数据集模型和相关采矿算法,这些模型能够有效地代表与最新数据相比更大的地雷数量级数。实际大型试验实例的数值结果表明,我们的算法在实践中也更快、更有效率。