The major problem of fitting a higher order Markov model is the exponentially growing number of parameters. The most popular approach is to use a Variable Length Markov Chain (VLMC), which determines relevant contexts (recent pasts) of variable orders and form a context tree. A more general approach is called Sparse Markov Model (SMM), where all possible histories of order $m$ form a partition so that the transition probability vectors are identical for the histories belonging to a particular group. We develop an elegant method of fitting SMM using convex clustering, which involves regularization. The regularization parameter is selected using BIC criterion. Theoretical results demonstrate the model selection consistency of our method for large sample size. Extensive simulation studies under different set-up have been presented to measure the performance of our method. We apply this method to classify genome sequences, obtained from individuals affected by different viruses.
翻译:安装一个更高顺序的Markov 模型的主要问题是参数数量成倍增长。最受欢迎的方法是使用一个可变长Markov 链(VLMC),它决定变量顺序的相关背景(最近的过去)并形成上下文树。一种更一般的方法叫Sprasse Markov 模型(SMM 模型),它使所有可能的顺序史都形成一个分割区,使过渡概率矢量与属于特定群体的历史完全相同。我们开发了一种优雅的方法,用组合组合来安装SMM,这涉及到正规化。正规化参数是使用BIC标准选择的。理论结果显示了我们大样本规模方法的模型选择一致性。在不同设置下进行了广泛的模拟研究,以衡量我们方法的性能。我们采用了这种方法对受不同病毒影响的个人的基因序列进行分类。