自然语言中信息周期性的识别 (Identifying the Periodicity of Information in Natural Language)

Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.

翻译：自然语言信息密度的最新理论进展提出了以下问题：自然语言在其编码信息中展现出何种程度的周期性模式？我们通过引入一种名为AutoPeriod of Surprisal（APS）的新方法来解决这一问题。APS采用经典的周期性检测算法，能够识别单个文档中惊奇值序列存在的任何显著周期。通过将该算法应用于一组语料库，我们获得了以下有趣的结果：首先，相当比例的人类语言在信息上表现出强烈的周期性模式；其次，我们发现并确认了超出文本典型结构单元（如句子边界、基本语篇单元等）分布范围的新周期，并通过谐波回归模型进一步验证。我们得出结论：语言中信息的周期性是结构化因素与其他在更长距离上起作用的驱动因素共同作用的结果。本文进一步讨论了我们的周期性检测方法的优势及其在大语言模型生成检测中的潜在应用。