The explosion in the availability of natural language data in the era of social media has given rise to a host of applications such as sentiment analysis and opinion mining. Simultaneously, the growing availability of precise geolocation information is enabling visualization of global phenomena such as environmental changes and disease propagation. Opportunities for tracking spatial variations in language use, however, have largely been overlooked, especially on small spatial scales. Here we explore the use of Twitter data with precise geolocation information to resolve spatial variations in language use on an urban scale down to single city blocks. We identify several categories of language tokens likely to show distinctive patterns of use and develop quantitative methods to visualize the spatial distributions associated with these patterns. Our analysis concentrates on comparison of contrasting pairs of Tweet distributions from the same category, each defined by a set of tokens. Our work shows that analysis of small-scale variations can provide unique information on correlations between language use and social context which are highly valuable to a wide range of fields from linguistic science and commercial advertising to social services.
翻译:在社交媒体时代,自然语言数据供应的爆炸性在社交媒体时代产生了许多应用,例如情绪分析和见解挖掘。与此同时,精确地理定位信息日益普及,使得环境变化和疾病传播等全球现象的可视化成为可能。然而,跟踪语言使用空间差异的机会,特别是在小规模空间尺度上,在很大程度上被忽略了。我们在这里探索使用带有精确地理定位信息的推特数据,以解决城市规模下至单一城市街区的语言使用空间差异。我们确定了几种语言符号,这些符号可能显示不同的使用模式,并开发量化方法,以可视化与这些模式有关的空间分布。我们的分析集中于比较同一类别Tweet分布的对比配对,每个类分布都是由一组符号界定的。我们的工作表明,对小规模差异的分析可以提供独特的信息,说明语言使用与社会背景之间的相互关系,这对从语言科学、商业广告到社会服务等广泛领域十分宝贵。