Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
翻译:Von Mises-Fisher分布的混合体可用于对单位超视距的数据进行分组,这特别适合文本等高维方向数据。我们在本篇文章中提议使用l 1 的固定可能性来估计冯 Mises混合物。这导致少见的原型改进了集群的可解释性。我们为这一估算采用了预期-最大化算法,并探索了宽度术语与沿途算法的可能性之间的权衡。模型的行为在模拟数据上进行了研究,并展示了实际数据基准方法的优点。我们还引入了一套新的财务报告数据,并展示了我们探索分析方法的好处。