A large number of modern applications ranging from listening songs online and browsing the Web to using a navigation app on a smartphone generate a plethora of user trails. Clustering such trails into groups with a common sequence pattern can reveal significant structure in human behavior that can lead to improving user experience through better recommendations, and even prevent suicides [LMCR14]. One approach to modeling this problem mathematically is as a mixture of Markov chains. Recently, Gupta, Kumar and Vassilvitski [GKV16] introduced an algorithm (GKV-SVD) based on the singular value decomposition (SVD) that under certain conditions can perfectly recover a mixture of L chains on n states, given only the distribution of trails of length 3 (3-trail). In this work we contribute to the problem of unmixing Markov chains by highlighting and addressing two important constraints of the GKV-SVD algorithm [GKV16]: some chains in the mixture may not even be weakly connected, and secondly in practice one does not know beforehand the true number of chains. We resolve these issues in the Gupta et al. paper [GKV16]. Specifically, we propose an algebraic criterion that enables us to choose a value of L efficiently that avoids overfitting. Furthermore, we design a reconstruction algorithm that outputs the true mixture in the presence of disconnected chains and is robust to noise. We complement our theoretical results with experiments on both synthetic and real data, where we observe that our method outperforms the GKV-SVD algorithm. Finally, we empirically observe that combining an EM-algorithm with our method performs best in practice, both in terms of reconstruction error with respect to the distribution of 3-trails and the mixture of Markov Chains.
翻译:从在线听歌和浏览网络到在智能手机上使用导航应用程序等大量现代应用,从在线听歌,到在智能手机上使用导航应用程序,产生大量用户线索。将这种线索分组分为具有共同序列模式的组群,可以揭示人类行为中的重要结构,通过更好的建议,甚至防止自杀[LMCR14],可以导致改善用户经验。数学模拟这一问题的方法之一是将Markov 链条混合在一起。最近,Gupta、Kumar和Vassilvitski[GKV16] 引入了一个基于单一价值分解(SVD)的算法(GKV-SVD ) 。基于单值分解(SVD),将这种算法归结成一组。在某些条件下,只要3号线条线条线条分布更好,就能完全恢复L链条的混合物混合物混合物混合物混合物混合物混合物的混合物混合物混合物混合物混合物混合物混合物。我们用精细的体积的体积,我们用GK-VD运算的算法,我们用这个标准来解决这些不混为G-VD的数值。