Motivated by theoretical advancements in dimensionality reduction techniques we use a recent model, called Block Markov Chains, to conduct a practical study of clustering in real-world sequential data. Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees and can be deployed in sparse data regimes. Despite these favorable theoretical properties, a thorough evaluation of these algorithms in realistic settings has been lacking. We address this issue and investigate the suitability of these clustering algorithms in exploratory data analysis of real-world sequential data. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. In order to evaluate the determined clusters, and the associated Block Markov Chain model, we further develop a set of evaluation tools. These tools include benchmarking, spectral noise analysis and statistical model selection tools. An efficient implementation of the clustering algorithm and the new evaluation tools is made available together with this paper. Practical challenges associated to real-world data are encountered and discussed. It is ultimately found that the Block Markov Chain model assumption, together with the tools developed here, can indeed produce meaningful insights in exploratory data analyses despite the complexity and sparsity of real-world data.
翻译:基于在维度减少技术方面的理论进步,我们最近使用了一个模型,称为Block Markov Channels, 来实际研究在现实世界相继数据中的集群问题。Block Markov Channels的集群算法具有理论最佳性保证,可以部署在稀少的数据系统中。尽管存在这些有利的理论属性,但在现实环境中对这些算法进行了彻底评估。我们处理这个问题,并调查这些集群算法在真实世界相继数据探索数据分析中的适宜性。特别是,我们的相继数据来自人类DNA、书面文本、动物流动数据和金融市场。为了评估已确定的集群,以及相关的Block Markov Chain Clack 数据模型,我们进一步开发了一套评估工具。这些工具包括基准、光谱噪音分析和统计模型选择工具。与本文一起提供高效的集群算法和新的评价工具。遇到并讨论了与现实世界数据相关的实际挑战。最终发现,尽管现实世界数据的复杂性和紧张性,但布洛克 Markov 链模型的假设,连同在这里开发的工具,确实可以在探索数据分析中产生有意义的见解。