Zipf 定律缩短词语长度的直接和间接证据：研究 Zipf 缩写定律 (Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited) - 专知论文

会员服务 ·

0

有向 · 优化器 · 可约的 · 相互独立的 · Principle ·

2023 年 3 月 17 日

Direct and indirect evidence of compression of word lengths. Zipf's law of abbreviation revisited

翻译：Zipf 定律缩短词语长度的直接和间接证据：研究 Zipf 缩写定律

Sonia Petrini,Antoni Casas-i-Muñoz,Jordi Cluet-i-Martinell,Mengxue Wang,Chris Bentz,Ramon Ferrer-i-Cancho

from arxiv, arXiv admin note: substantial text overlap with arXiv:2208.10384

Zipf's law of abbreviation, the tendency of more frequent words to be shorter, is one of the most solid candidates for a linguistic universal, in the sense that it has the potential for being exceptionless or with a number of exceptions that is vanishingly small compared to the number of languages on Earth. Since Zipf's pioneering research, this law has been viewed as a manifestation of a universal principle of communication, i.e. the minimization of word lengths, to reduce the effort of communication. Here we revisit the concordance of written language with the law of abbreviation. Crucially, we provide wider evidence that the law holds also in speech (when word length is measured in time), in particular in 46 languages from 14 linguistic families. Agreement with the law of abbreviation provides indirect evidence of compression of languages via the theoretical argument that the law of abbreviation is a prediction of optimal coding. Motivated by the need of direct evidence of compression, we derive a simple formula for a random baseline indicating that word lengths are systematically below chance, across linguistic families and writing systems, and independently of the unit of measurement (length in characters or duration in time). Our work paves the way to measure and compare the degree of optimality of word lengths in languages.

翻译：Zipf 缩写定律是语言学普遍规律的最坚实候选者之一，即频次更高的词语往往更短。自 Zipf 的开创性研究以来，该定律被视为沟通的通用原则之一，即最小化词语长度以减少沟通成本。本研究重新审视书面语言与缩写定律的符合性。关键是，我们提供了更广泛的证据：该定律在语音（当以时间测量词语长度时）中也成立，特别是在来自14个语系的46种语言中。缩写定律的一致性提供了编码最优性理论的间接证据，即缩写定律是最佳编码的预测。由于需要直接证据来证明语言的压缩，我们推导出一个简单的随机基线公式，表明单词长度系统地低于机会水平，跨语系和书写系统，并且独立于测量单位（字符长度或时间持续时间）。我们的工作为测量和比较语言中单词长度的优化程度奠定了基础。

0

相关内容

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

专知会员服务

21+阅读 · 2020年6月4日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

专知会员服务

41+阅读 · 2020年2月26日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

BERT 瘦身之路：Distillation，Quantization，Pruning

BERT 瘦身之路：Distillation，Quantization，Pruning

AINLP

10+阅读 · 2019年10月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Nature 一周论文导读 | 2018 年 3 月 29 日

Nature 一周论文导读 | 2018 年 3 月 29 日

科研圈

12+阅读 · 2018年4月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

WTX通过ARHGDIA/CDC42/PAKs调控细胞骨架稳定性抑制结直肠癌肝转移机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

鞘氨醇代谢通路在早期胚胎转运和发育及输卵管妊娠发生中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

基于ePSF的空间碎片高精度位置测量研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥MED25互作蛋白MIP1调控茉莉酸信号途径的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

基于HIF-1α信号途径研究硫化氢对缺氧诱导Aβ生成和聚积的抑制作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

藤黄酸抗B细胞非霍奇金淋巴瘤新机制- - 调控SRC-3/组蛋白乙酰化转录复合物SUMO化修饰

国家自然科学基金

0+阅读 · 2012年12月31日

RANK-钙离子ATP酶新机制阻止足细胞损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

组合序列的实零点性和对数凸性研究

国家自然科学基金

0+阅读 · 2011年12月31日

线粒体钙离子参与疼痛与镇痛中枢机制的作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于毒损脑络病机的阿尔茨海默病治疗方药的分子靶点研究

国家自然科学基金

0+阅读 · 2010年12月31日

Investigating the effect of sub-word segmentation on the performance of transformer language models

Arxiv

0+阅读 · 2023年5月9日

Consistent Text Categorization using Data Augmentation in e-Commerce

Arxiv

0+阅读 · 2023年5月9日

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Arxiv

0+阅读 · 2023年5月9日

Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

Arxiv

0+阅读 · 2023年5月8日

Rate-Distortion Theory for Mixed States

Arxiv

0+阅读 · 2023年5月7日

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Arxiv

0+阅读 · 2023年5月7日

Minimum-Membership Geometric Set Cover, Revisited

Arxiv

0+阅读 · 2023年5月6日

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Arxiv

0+阅读 · 2023年5月5日

Forecasting: theory and practice

Arxiv

57+阅读 · 2022年1月5日

A Survey of Quantization Methods for Efficient Neural Network Inference

Arxiv

22+阅读 · 2021年6月21日

VIP会员

文章信息

相关主题

相互独立的

相关VIP内容

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

神经网络与形式语言综述，12页pdf，A Survey of Neural Networks and Formal Languages

专知会员服务

21+阅读 · 2020年6月4日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

贝叶斯网络在医疗的应用综述：过去，现在和未来 | A Comprehensive Scoping Review of Bayesian Networks in Healthcare: Past, Present and Future

专知会员服务

41+阅读 · 2020年2月26日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

反无人机：乌克兰拦截型无人机系列一览

《自适应鲁棒马尔可夫决策过程：协同作战飞机（CCA）对抗性监视任务应用》44页技术报告

物理学中的高级深度学习

观点动力学：全面综述

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

BERT 瘦身之路：Distillation，Quantization，Pruning

BERT 瘦身之路：Distillation，Quantization，Pruning

AINLP

10+阅读 · 2019年10月22日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Nature 一周论文导读 | 2018 年 3 月 29 日

Nature 一周论文导读 | 2018 年 3 月 29 日

科研圈

12+阅读 · 2018年4月7日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Investigating the effect of sub-word segmentation on the performance of transformer language models

Arxiv

0+阅读 · 2023年5月9日

Consistent Text Categorization using Data Augmentation in e-Commerce

Arxiv

0+阅读 · 2023年5月9日

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

Arxiv

0+阅读 · 2023年5月9日

Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

Arxiv

0+阅读 · 2023年5月8日

Rate-Distortion Theory for Mixed States

Arxiv

0+阅读 · 2023年5月7日

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Arxiv

0+阅读 · 2023年5月7日

Minimum-Membership Geometric Set Cover, Revisited

Arxiv

0+阅读 · 2023年5月6日

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

Arxiv

0+阅读 · 2023年5月5日

Forecasting: theory and practice

Arxiv

57+阅读 · 2022年1月5日

A Survey of Quantization Methods for Efficient Neural Network Inference

Arxiv

22+阅读 · 2021年6月21日

相关基金

WTX通过ARHGDIA/CDC42/PAKs调控细胞骨架稳定性抑制结直肠癌肝转移机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

鞘氨醇代谢通路在早期胚胎转运和发育及输卵管妊娠发生中的作用

国家自然科学基金

0+阅读 · 2013年12月31日

基于ePSF的空间碎片高精度位置测量研究

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥MED25互作蛋白MIP1调控茉莉酸信号途径的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

基于HIF-1α信号途径研究硫化氢对缺氧诱导Aβ生成和聚积的抑制作用及机制

国家自然科学基金

0+阅读 · 2012年12月31日

藤黄酸抗B细胞非霍奇金淋巴瘤新机制- - 调控SRC-3/组蛋白乙酰化转录复合物SUMO化修饰

国家自然科学基金

0+阅读 · 2012年12月31日

RANK-钙离子ATP酶新机制阻止足细胞损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

组合序列的实零点性和对数凸性研究

国家自然科学基金

0+阅读 · 2011年12月31日

线粒体钙离子参与疼痛与镇痛中枢机制的作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于毒损脑络病机的阿尔茨海默病治疗方药的分子靶点研究

国家自然科学基金

0+阅读 · 2010年12月31日

微信扫码咨询专知VIP会员