优化的语词长度：理论基础与实证研究 (The optimality of word lengths. Theoretical foundations and an empirical study)

from arxiv, On the one hand, the article has been reduced: analyses of the law of abbreviation and some of the methods have been moved to another article; appendix B has been reduced. On the other hand, various parts have been rewritten for clarity; new figures have been added to ease the understanding of the scores; new citations added. Many typos have been corrected

Zipf's law of abbreviation, namely the tendency of more frequent words to be shorter, has been viewed as a manifestation of compression, i.e. the minimization of the length of forms -- a universal principle of natural communication. Although the claim that languages are optimized has become trendy, attempts to measure the degree of optimization of languages have been rather scarce. Here we present two optimality scores that are dualy normalized, namely, they are normalized with respect to both the minimum and the random baseline. We analyze the theoretical and statistical pros and cons of these and other scores. Harnessing the best score, we quantify for the first time the degree of optimality of word lengths in languages. This indicates that languages are optimized to 62 or 67 percent on average (depending on the source) when word lengths are measured in characters, and to 65 percent on average when word lengths are measured in time. In general, spoken word durations are more optimized than written word lengths in characters. Our work paves the way to measure the degree of optimality of the vocalizations or gestures of other species, and to compare them against written, spoken, or signed human languages.

翻译：英国语言学家 George Zipf 提出的“缩略定律”指出，更频繁使用的单词往往更短，这被视为压缩的一种表现形式，即自然交流的普遍原则。虽然语言被认为是优化的，但尝试测量语言的优化程度的研究却相当不足。在本文中，我们提出两个最优性得分，这些得分在两种基线（最小基线和随机基线）方面进行了归一化。我们分析了这些得分的理论和统计优缺点。利用得分，我们首次量化了语言中词长的最优程度。这表明当以字符为单位测量单词长度时，语言平均优化了62%或67%（取决于来源），而当以时间为单位测量单词长度时，平均优化了65%。总的来说，口语单词持续时间比书面单词长度更优化。我们的工作为衡量其他物种的语言或手势的最优程度铺平了道路，并将它们与书面、口头或手语的人类语言进行比较。