以单词嵌入为统计估算器 (Word Embeddings as Statistical Estimators)

Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). Next, by building on the work of Levy and Goldberg (2014), we develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (2014). The proposed estimator also performs comparably to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set.

翻译：文字嵌入是自然语言处理中的一个基本工具。目前, 以基准数据集的经验性表现为基础, 对字嵌入方法进行评估, 并且缺乏对其理论属性的严格理解。本文从统计理论角度研究嵌入的词, 这对于正式推断和不确定性量化至关重要。我们为文本数据提出了一个基于千字板的统计模型, 并表明, 在这个模型下, 现在的经典 Word2Vec 方法可以被解释为一种统计估计方法, 用于估算理论点对准的相互信息( PMI ) 。其次, 在Levy 和 Goldberg (2014年) 的工作基础上, 我们开发了一个缺失的基于价值的估测器, 以作为可统计性可移动和可解释的Word2Vec 方法的替代方法。这个估计器的估算误差与Word2Vec相似, 并改进了Levy和Goldberg (2014年) 提出的脱轨方法。拟议的估测算器在IMDb电影审查数据集的基准感知觉分析任务中, 也与Word2Vec 相匹配。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

44+阅读 · 2020年12月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日