Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy and Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (2014). The proposed estimator also performs comparably to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set.
翻译:文字嵌入是自然语言处理中的一个基本工具。 目前, 以基准数据集的经验性表现为基础, 对字嵌入方法进行评估, 并且缺乏对其理论属性的严格理解。 本文从统计理论角度研究嵌入的词, 这对于正式推断和不确定性量化至关重要。 我们为文本数据提出了一个基于千字板的统计模型, 并表明, 在这个模型下, 现在的经典 Word2Vec 方法可以被解释为一种统计估计方法, 用于估算理论点对准的相互信息( PMI ) 。 其次, 在Levy 和 Goldberg (2014年) 的工作基础上, 我们开发了一个缺失的基于价值的估测器, 以作为可统计性可移动和可解释的Word2Vec 方法的替代方法。 这个估计器的估算误差与Word2Vec相似, 并改进了Levy和Goldberg (2014年) 提出的脱轨方法。 拟议的估测算器在IMDb电影审查数据集的基准感知觉分析任务中, 也与Word2Vec 相匹配。