It was observed that large language models exhibit a power-law decay of cross entropy with respect to the number of parameters and training tokens. When extrapolated literally, this decay implies that the entropy rate of natural language is zero. To understand this phenomenon -- or an artifact -- better, we construct a simple stationary stochastic process and its memory-based predictor that exhibit a power-law decay of cross entropy with the vanishing entropy rate. Our example is based on previously discussed Santa Fe processes, which decompose a random text into a process of narration and time-independent knowledge. Previous discussions assumed that narration is a memoryless source with Zipf's distribution. In this paper, we propose a model of narration that has the vanishing entropy rate and applies a randomly chosen deterministic sequence called a multiperiodic sequence. Under a suitable parameterization, multiperiodic sequences exhibit asymptotic relative frequencies given by Zipf's law. Remaining agnostic about the value of the entropy rate of natural language, we discuss relevance of similar constructions for language modeling.
翻译:人们发现,大型语言模型在参数数和培训符号数方面呈现了跨星体的电法衰变。 实际外推时, 这一衰变意味着自然语言的环球速率为零。 要更好地理解这一现象 -- -- 或人工制品 -- -- 我们建造了一个简单的定点透析过程及其内存预测器, 显示随着消失的酶速率而出现的跨星体的电法衰变。 我们的例子基于以前讨论过的Santa Fe 进程, 它将随机文本分解成一个解析和时间依赖的知识过程。 先前的讨论假设, 解析是一个没有记忆的源, 有齐普夫的分布。 在本文中, 我们提出一个解析模式, 它将具有消失的酶速率, 并应用一个随机选择的确定序列, 称为多周期序列。 在适当的参数化下, 多周期序列将展示Zipf 法律给出的无症状相对频率。 仍然对自然语言的伦比率值进行统计, 我们讨论类似构造对语言模型的相关性。