最优化的字数长度,理论基础和一项经验研究 (The optimality of word lengths. Theoretical foundations and an empirical study)

One of the most robust patterns found in human languages is Zipf's law of abbreviation, that is, the tendency of more frequent words to be shorter. Since Zipf's pioneering research, this law has been viewed as a manifestation of compression, i.e. the minimization of the length of forms - a universal principle of natural communication. Although the claim that languages are optimized has become trendy, attempts to measure the degree of optimization of languages have been rather scarce. Here we demonstrate that compression manifests itself in a wide sample of languages without exceptions, and independently of the unit of measurement. It is detectable for both word lengths in characters of written language as well as durations in time in spoken language. Moreover, to measure the degree of optimization, we derive a simple formula for a random baseline and present two scores that are dualy normalized, namely, they are normalized with respect to both the minimum and the random baseline. We analyze the theoretical and statistical advantages and disadvantages of these and other scores. Harnessing the best score, we quantify for the first time the degree of optimality of word lengths in languages. This indicates that languages are optimized to 62 or 67 percent on average (depending on the source) when word lengths are measured in characters, and to 65 percent on average when word lengths are measured in time. In general, spoken word durations are more optimized than written word lengths in characters. Beyond the analyses reported here, our work paves the way to measure the degree of optimality of the vocalizations or gestures of other species, and to compare them against written, spoken, or signed human languages.

翻译：在人文语言中发现的最稳健的模式之一是齐普夫的缩写法,即更频繁的单词倾向于缩短。自齐普夫的开创性研究以来,这项法律被视为压缩的表现,即尽量减少形式长度----一种普遍的自然交流原则。虽然语言优化的主张已变得潮流化,但衡量语言优化程度的尝试却相当稀少。在这里,我们证明压缩本身表现在广泛的语言样本中,没有例外,而且独立于计量单位。书面语言字符的字长度和口语时间长度都可探测到。此外,为了衡量优化程度,我们为随机基线制定了简单的公式,并提出了两个分数,即语言优化程度与最低基线和随机基线标准一致。我们分析了这些语言和其他分数的理论和统计优劣之处。我们首次量化书面文字长度的最佳程度,我们首次量化书面文字长度的程度,以及口语时间长度与口语时间长度的长度都可检测,在平均语言的正值或正值上,在平均页长度上,语言比平均的页长度为65或正值,在平均时间上,在平均的正度上,语言比正值为平均或正值的正值,在排序中,在排序中,在排序中,语言到正值为62%或正值中,比正值。在平均或正值的正值中,在平均的文字时间里度上,在平均比时间里,在平均的文字时间里为62%。