最优化的字数长度,理论基础和一项经验研究 (The optimality of word lengths. Theoretical foundations and an empirical study)

One of the most robust patterns found in human languages is Zipf's law of abbreviation, that is, the tendency of more frequent words to be shorter. Since Zipf's pioneering research, this law has been viewed as a manifestation of compression, i.e. the minimization of the length of forms - a universal principle of natural communication. Although the claim that languages are optimized has become trendy, attempts to measure the degree of optimization of languages have been rather scarce. Here we demonstrate that compression manifests itself in a wide sample of languages without exceptions, and independently of the unit of measurement. It is detectable for both word lengths in characters of written language as well as durations in time in spoken language. Moreover, to measure the degree of optimization, we derive a simple formula for a random baseline and present two scores that are dualy normalized, namely, they are normalized with respect to both the minimum and the random baseline. We analyze the theoretical and statistical pros and cons of these and other scores. Harnessing the best score, we quantify for the first time the degree of optimality of word lengths in languages. This indicates that languages are optimized to 62 or 67 percent on average (depending on the source) when word lengths are measured in characters, and to 65 percent on average when word lengths are measured in time. In general, spoken word durations are more optimized than written word lengths in characters. Beyond the analyses reported here, our work paves the way to measure the degree of optimality of the vocalizations or gestures of other species, and to compare them against written, spoken, or signed human languages.

翻译：在人文语言中发现的最稳健的模式之一是齐普夫的缩写法,即更频繁的单词倾向于缩短。自齐普夫的开创性研究以来,这项法律被视为压缩的表现,即尽量减少形式长度----一种普遍的自然交流原则。虽然语言优化的说法已变得潮流化,但衡量语言优化程度的尝试却相当稀少。在这里,我们证明压缩本身表现在广泛的语言样本中,没有例外,而且独立于测量单位。书面语言字符的字长度和口语时间长度都可以检测到。此外,为了衡量优化程度,我们为随机基线制定了简单的公式,并提出了两个分数,即语言优化程度是最低和随机基线,我们分析了这些语言和其他分数的理论和统计利得和利得程度。我们用最优分来计算,我们第一次量化书面语言字符长度的优化程度,比书面语言签名的直径长以及口语时间长度都可检测到口语的平均程度。这显示,在平均语言的页长度上,比平均页长度为平均或平均页长度,在65或平均时间上,测量语言的比平均页长度为平均页长度。这表示,在65或平均的文字时间里是最优化。在格式上,在语言中,在平均或平均的文字上,在语言中,时间里是最优化到平均的顺序上,在65或平均的顺序上,在语言中,在语言到平均的顺序上是最或平均的顺序至62或平均时间里,在语言的顺序上,在排序。