文字级演变统计模式 (A Statistical Model of Word Rank Evolution)

from arxiv, This manuscript - with 30 pages (main), 10 figures (main), 22 pages (supplementary), and 17 figures (supplementary) - is a manuscript for a journal research article

The availability of large linguistic data sets enables data-driven approaches to study linguistic change. The Google Books corpus unigram frequency data set is used to investigate the word rank dynamics in eight languages. We observed the rank changes of the unigrams from 1900 to 2008 and compared it to a Wright-Fisher inspired model that we developed for our analysis. The model simulates a neutral evolutionary process with the restriction of having no disappearing and added words. This work explains the mathematical framework of the model - written as a Markov Chain with multinomial transition probabilities - to show how frequencies of words change in time. From our observations in the data and our model, word rank stability shows two types of characteristics: (1) the increase/decrease in ranks are monotonic, or (2) the rank stays the same. Based on our model, high-ranked words tend to be more stable while low-ranked words tend to be more volatile. Some words change in ranks in two ways: (a) by an accumulation of small increasing/decreasing rank changes in time and (b) by shocks of increase/decrease in ranks. Most words in all of the languages we have looked at are rank stable, but not as stable as a neutral model would predict. The stopwords and Swadesh words are observed to be rank stable across eight languages indicating linguistic conformity in established languages. These signatures suggest unigram frequencies in all languages have changed in a manner inconsistent with a purely neutral evolutionary process.

翻译：大型语言数据集的可用性使得以数据驱动的方法来研究语言变化。 Google Books Pasample unigram 频率数据集用于调查八种语言的单词级动态。我们观察了从1900年到2008年的单词级变化,并将其与我们为分析而开发的Wright-Fisher启发型模型进行了比较。模型模拟了一个中立的进化过程,限制不消失和增加单词。这项工作解释了模型的数学框架 — 书写成具有多重中性过渡概率的Markov链条 — 以显示文字在时间上的频率变化。从我们的数据和我们的模型的观察看,单词级稳定性显示两种特征:(1) 单词级/降,或(2) 级不变。基于我们的模型,高调的词会比较稳定,低调的单词会比较不稳定。某些字以两种方式排列的顺序:(a) 以小幅递增/定的顺序排列顺序排列为时间顺序变化,以及(b) 以递增/递减顺序的冲击显示两种不同的语言的等级。我们所观察到的所有语言的中性语言的顺序将显示为稳定的顺序是稳定的, 稳定的顺序是稳定的语言的顺序的顺序是稳定的, 。所有语言的顺序是稳定的语言的顺序的顺序,以稳定的语言的顺序显示的顺序是稳定的语言的顺序的顺序,以稳定的语言的顺序显示的顺序是稳定的排列的顺序。