Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions that is vanishingly small compared to the number of languages on Earth. Since Zipf's pioneering research, this law has been viewed as a manifestation of a universal principle of communication, i.e. the minimization of word lengths, to reduce the effort of communication. Here we revisit the concordance of written language with the law of abbreviation. Crucially, we provide wider evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families. Agreement with the law of abbreviation provides indirect evidence of compression of languages via the theoretical argument that the law of abbreviation is a prediction of optimal coding. Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance, across linguistic families and writing systems, and independently of the unit of measurement (length in characters or duration in time). Our work paves the way to measure and compare the degree of optimality of word lengths in languages.
翻译:Zipf 缩写定律是语言学普遍规律的最坚实候选者之一,即频次更高的词语往往更短。自 Zipf 的开创性研究以来,该定律被视为沟通的通用原则之一,即最小化词语长度以减少沟通成本。本研究重新审视书面语言与缩写定律的符合性。关键是,我们提供了更广泛的证据:该定律在语音(当以时间测量词语长度时)中也成立,特别是在来自14个语系的46种语言中。缩写定律的一致性提供了编码最优性理论的间接证据,即缩写定律是最佳编码的预测。由于需要直接证据来证明语言的压缩,我们推导出一个简单的随机基线公式,表明单词长度系统地低于机会水平,跨语系和书写系统,并且独立于测量单位(字符长度或时间持续时间)。我们的工作为测量和比较语言中单词长度的优化程度奠定了基础。