The uniform information density (UID) hypothesis, which posits that speakers behaving optimally tend to distribute information uniformly across a linguistic signal, has gained traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. In this work, we explore whether the UID hypothesis can be operationalized as an inductive bias for statistical language modeling. Specifically, we augment the canonical MLE objective for training language models with a regularizer that encodes UID. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via an analysis of generated sequences, we find that UID-regularized language models have other desirable properties, e.g., they generate text that is more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools.
翻译:统一信息密度假设(UID)假设认为,发言者最好倾向于在一个语言信号中统一传播信息,这种假设在精神语言学中获得了牵引力,以解释某些综合、形态学和预测性选择。在这项工作中,我们探讨统一信息密度假设是否可以作为统计语言模型的诱导偏差加以操作。具体地说,我们用一个编码UID的常规化器来强化培训语言模型的典型语言目标。在涉及五种语言组的十种语言组的实验中,我们发现使用统一信息正规化不断改善语言模型的易懂性,在培训数据有限时产生更大的效果。此外,通过分析生成的序列,我们发现UID正规化语言模型具有其他可取的属性,例如,它们生成的文本在词汇上更加多样化。我们的结果不仅表明,通用数据对于语言模型的描述偏差是合理的,而且还用现代NLP工具对通用信息假设进行替代验证。