The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution -- claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the na\"ive use of neural character-level language models.
翻译:Unigram 分布方式是找到一个文体中特定单词形式的非逻辑概率。 虽然对语言研究具有核心重要性, 但它通常被每个字在文体中的样本频率所近似。 这种方法高度依赖样本大小, 将零概率赋予任何外词汇( oov) 单词形式。 因此, 它为任何oov 单词形式产生负偏差概率, 同时对体内单词具有积极的偏向性。 在这项工作中, 我们主张支持恰当地建模单词分布方式 -- 声称它在自然语言处理中应该是一项核心任务。 有鉴于此, 我们提出了一个新颖的模式, 用来用一种语言来估计它( Goldwater et al. (2011年) 模式的神经化), 并显示它比“ 神经级语言模型的动态使用” 更好的估计方式在7种语言中产生更好的估计数。