Nowadays, with the rapid development of the Internet, the era of big data has come. The Internet generates huge amounts of data every day. However, extracting meaningful information from massive data is like looking for a needle in a haystack. Data mining techniques can provide various feasible methods to solve this problem. At present, many sequential rule mining (SRM) algorithms are presented to find sequential rules in databases with sequential characteristics. These rules help people extract a lot of meaningful information from massive amounts of data. How can we achieve compression of mined results and reduce data size to save storage space and transmission time? Until now, there has been little research on the compression of SRM. In this paper, combined with the Minimum Description Length (MDL) principle and under the two metrics (support and confidence), we introduce the problem of compression of SRM and also propose a solution named ComSR for MDL-based compressing of sequential rules based on the designed sequential rule coding scheme. To our knowledge, we are the first to use sequential rules to encode an entire database. A heuristic method is proposed to find a set of compact and meaningful sequential rules as much as possible. ComSR has two trade-off algorithms, ComSR_non and ComSR_ful, based on whether the database can be completely compressed. Experiments done on a real dataset with different thresholds show that a set of compact and meaningful sequential rules can be found. This shows that the proposed method works.
翻译:目前,随着互联网的迅速发展,大数据时代已经到来。互联网每天都产生大量的数据。然而,从大量数据中提取有意义的信息就像在干草堆中寻找针头一样。数据开采技术可以提供各种可行的方法解决这一问题。目前,许多顺序规则采矿算法被介绍到具有相继特征的数据库中寻找顺序规则。这些规则有助于人们从大量数据中提取大量有意义的信息。我们如何压缩埋存的结果并缩小数据大小以保存存储空间和传输时间?到目前为止,关于压缩SRM的研究很少。本文中,结合最低描述长度原则(MDL)和两个标准(支持和信任),我们提出了简化标准采矿方法的各种可行方法。目前,许多顺序规则(SRMMMM)算法(SR)算出了一系列基于设计顺序规则的顺序规则。据我们所知,我们首先使用顺序规则来编译整个数据库。建议采用超导法的方法来寻找一套紧凑和有意义的顺序规则。在本文中,可以完全由IML(Com)交易规则来展示一个真正的直系。