The overwhelming presence of categorical/sequential data in diverse domains emphasizes the importance of sequence mining. The challenging nature of sequences proves the need for continuing research to find a more accurate and faster approach providing a better understanding of their (dis)similarities. This paper proposes a new Model-based approach for clustering sequence data, namely nTreeClus. The proposed method deploys Tree-based Learners, k-mers, and autoregressive models for categorical time series, culminating with a novel numerical representation of the categorical sequences. Adopting this new representation, we cluster sequences, considering the inherent patterns in categorical time series. Accordingly, the model showed robustness to its parameter. Under different simulated scenarios, nTreeClus improved the baseline methods for various internal and external cluster validation metrics for up to 10.7% and 2.7%, respectively. The empirical evaluation using synthetic and real datasets, protein sequences, and categorical time series showed that nTreeClus is competitive or superior to most state-of-the-art algorithms.
翻译:不同领域绝对/序列数据的压倒性存在突出表明了采矿序列的重要性。序列具有挑战性,证明需要继续研究,以找到更准确和更快的方法,更好地了解其(不同)差异。本文件提出了一个新的基于模型的组合序列数据模型方法,即 nTreeClus。拟议方法为绝对时间序列部署基于树的学习者、 k- mers 和自动递减模型,最终以新颖的数字表示绝对序列。采用这一新的代表性,我们分组序列,考虑到绝对时间序列的内在模式。因此,模型显示了其参数的稳健性。在不同模拟假设下,NTReeClus改进了各种内部和外部群集验证指标的基准方法,分别达到10.7%和2.7%。使用合成和真实数据集、蛋白序列和绝对时间序列进行的经验评估表明,正列Clus具有竞争力,或优于大多数最先进的算法。