Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-referenced social media and web data. The goal, however, has been to describe these corpora themselves rather than to make inferences about underlying populations. This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora that is introduced by non-local populations. These methods tell us where significant changes have taken place and whether this leads to increased or decreased diversity. This is an important step in aligning digital corpora like social media with the real-world populations that have produced them.
翻译:使用数字语言数据衡量语言多样性的计算尺度有助于我们理解语言景观。本文的贡献是利用COVID-19大流行对国际旅行的限制来校准语言多样性衡量尺度。以前的工作利用地理参照的社会媒体和网络数据绘制了语言分布图。然而,其目的一直是描述这些社团本身,而不是对基础人口作出推论。本文表明,基于Herfindahl-Hirschman指数的差别法可以确定非本地人口引入的数字社团中的偏见。这些方法告诉我们发生了哪些重大变化,以及这是否导致多样性的增加或减少。这是使像社会媒体这样的数字社团与形成这些社团的现实人口相一致的重要一步。