Statistical linguistics has advanced considerably in recent decades as data has become available. This has allowed researchers to study how statistical properties of languages change over time. In this work, we use data from Twitter to explore English and Spanish considering the rank diversity at different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to 3000+km radii), and grammatical (from monograms to pentagrams). We find that all three scales are relevant. However, the greatest changes come from variations in the grammatical scale. At the lowest grammatical scale (monograms), the rank diversity curves are most similar, independently on the values of other scales, languages, and countries. As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales, as well as on the language and country. We also study the statistics of Twitter-specific tokens: emojis, hashtags, and user mentions. These particular type of tokens show a sigmoid kind of behaviour as a rank diversity function. Our results are helpful to quantify aspects of language statistics that seem universal and what may lead to variations.
翻译:近几十年来,随着数据的出现,统计语言有了相当大的进步。这使得研究人员能够研究不同语言的统计特性随时间变化。在这项工作中,我们利用来自Twitter的数据来探索英语和西班牙语,以探讨不同尺度的等级多样性:时间(从3小时间隔到96小时间隔)、空间(从3公里到3000公里至3000公里的半径)和语法(从单数到五角形)以及语法(从单数到五角形),我们发现所有三个尺度都是相关的。然而,最大的变化来自语法尺度的变化。在最低的语法尺度(单数)中,等级多样性曲线与其他尺度、语言和国家的值非常相似。随着语法规模的扩大,等级多样性曲线因时间和空间尺度以及语言和国家而变化更大。我们还研究了Twitter特定标志的统计:mojis、标签和用户提到的。这些特定类型的标志显示了等级多样性功能。我们的成果有助于量化语言统计中似乎具有普遍性和可能导致变化的方面。