对专题模型中分散专题分布及其应用于瓦塞尔斯坦文件远距离计算的可能性估计 (Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations)

This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in topic models. The data consists of observed multinomial counts of $p$ words across $n$ independent documents. In topic models, the $p\times n$ expected word frequency matrix is assumed to be factorized as a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $T$. Since columns of both matrices represent conditional probabilities belonging to probability simplices, columns of $A$ are viewed as $p$-dimensional mixture components that are common to all documents while columns of $T$ are viewed as the $K$-dimensional mixture weights that are document specific and are allowed to be sparse. The main interest is to provide sharp, finite sample, $\ell_1$-norm convergence rates for estimators of the mixture weights $T$ when $A$ is either known or unknown. For known $A$, we suggest MLE estimation of $T$. Our non-standard analysis of the MLE not only establishes its $\ell_1$ convergence rate, but reveals a remarkable property: the MLE, with no extra regularization, can be exactly sparse and contain the true zero pattern of $T$. We further show that the MLE is both minimax optimal and adaptive to the unknown sparsity in a large class of sparse topic distributions. When $A$ is unknown, we estimate $T$ by optimizing the likelihood function corresponding to a plug in, generic, estimator $\hat{A}$ of $A$. For any estimator $\hat{A}$ that satisfies carefully detailed conditions for proximity to $A$, the resulting estimator of $T$ is shown to retain the properties established for the MLE. The ambient dimensions $K$ and $p$ are allowed to grow with the sample sizes. Our application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between two probabilistic document representations.

翻译：本文研究主题模型中高维、离散、可能稀少的混合模型的估算值。数据由观察到的以美元计价、美元独立文档中以美元为单位的多位数计价。在主题模型中, 假设美元( 美元) 的预期单词频度矩阵是按美元( K) 美元( 美元) 字数( 美元) 和美元( 美元) 主题文档矩阵的估算值( 美元) 。由于两个矩阵的列是属于概率的有条件概率。 $( 美元) 是所有文件通用的美元( 美元) 。而美元( 美元) 的多维数( 美元) 的计算值( 美元) 。美元( 美元) 美元( 美元) ( 美元) ( 美元) ( 美元) ( 美元) ( ) ( 美元) ( ) ( 美元) ( 美元) ( ) ( 美元) ( ) ( ) ( 美元) ( ) ( 美元) ( ) ( ) ( 美元) (美元) ( ) (美元) (美元) (美元) (美元) ( ) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (我们(美元) (美元) (美元) (美元) (美元) (美元) (美元(美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (我们(美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (美元) (