The uniform information density (UID) hypothesis, which posits that speakers prefer utterances that distribute information uniformly across the signal, has gained substantial traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. Could we operationalize uniform information density as an inductive bias for statistical language modeling? In this paper, we augment the canonical MLE objective for training language models by encoding UID as regularization. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via analysis of generated sequences, we find that UID-regularized language models are higher-entropy and produce text that is longer and more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools.
翻译:统一信息密度假设(UID)假设认为,发言者更喜欢在整个信号中统一传播信息的语句,这种假设在精神语言学中获得了很大的牵引力,作为某些合成、形态学和预测性选择的解释。我们能否将统一信息密度作为统计语言建模的感化偏差加以操作?在本文中,我们通过将UID编码为正规化来增强语言模型培训的教义性 MLE目标。在对10种语言的5种语言家庭的实验中,我们发现使用UID正规化不断改善语言模型的复杂性,在培训数据有限时产生更大的效果。 此外,通过分析生成的序列,我们发现UID正规化语言模型具有更高的实用性,生成的文本更长、更具有法律上的多样性。我们的结果不仅表明UID是语言建模的合理诱导偏,而且还提供了使用现代NLP工具对UID假设的替代验证。