The Wasserstein distance between mixing measures has come to occupy a central place in the statistical analysis of mixture models. This work proposes a new canonical interpretation of this distance and provides tools to perform inference on the Wasserstein distance between mixing measures in topic models. We consider the general setting of an identifiable mixture model consisting of mixtures of distributions from a set $\mathcal{A}$ equipped with an arbitrary metric $d$, and show that the Wasserstein distance between mixing measures is uniquely characterized as the most discriminative convex extension of the metric $d$ to the set of mixtures of elements of $\mathcal{A}$. The Wasserstein distance between mixing measures has been widely used in the study of such models, but without axiomatic justification. Our results establish this metric to be a canonical choice. Specializing our results to topic models, we consider estimation and inference of this distance. Though upper bounds for its estimation have been recently established elsewhere, we prove the first minimax lower bounds for the estimation of the Wasserstein distance in topic models. We also establish fully data-driven inferential tools for the Wasserstein distance in the topic model context. Our results apply to potentially sparse mixtures of high-dimensional discrete probability distributions. These results allow us to obtain the first asymptotically valid confidence intervals for the Wasserstein distance in topic models.
翻译:Wasserstein距离在混合模型的统计分析中已经聚集了中心地位。本文提出了一个新的、规范的解释来说明这个距离,并提供了在主题模型中进行Wasserstein距离的推断的工具。我们考虑了混合模型的一个可辨识模型,由来自集合$\mathcal{A}$中的分布组成,并配备了任意的度量$d$。我们证明了在混合$\mathcal{A}$中,Wasserstein距离是将度量$d$唯一地扩展到$\mathcal{A}$的元素混合集合上的最有区分性的凸扩展。Wasserstein距离在混合模型的研究中得到了广泛的应用,但没有公理证明。我们的结果确立了这种距离作为一个规范选择。针对主题模型,我们考虑了这个距离的估计和推断。尽管在其他地方已经建立了它的估计的上限,但我们证明了Wasserstein距离在主题模型中估计的最佳下限。我们还在主题模型的上下文中建立了数据驱动的推断工具。我们的结果适用于高维离散概率分布的潜在稀疏混合物。这些结果使我们能够获得第一个主题模型中Wasserstein距离的渐近有效置信区间。