Storywrangler:利用Twitter为社会语言、文化、社会经济及政治时间表进行大规模探索 (Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter)

In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of n-grams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through 'contagiograms'. We also present some example case studies that bridge n-gram time series with disparate data sources to explore sociotechnical dynamics of famous individuals, box office success, and social unrest.

翻译：在实时,社交媒体数据强烈地刻画了世界事件、大众文化和数百万普通民众的日常对话,其规模很少常规化和记录。社交媒体平台拥有大量社交媒体数据,而且许多标准公司(如书籍和新闻档案)缺少分享和评论机制,这使得我们能够量化趋势故事线和当代文化现象的社会放大(即流行程度)和当代文化现象。在这里,我们描述了Storywrangler,这是一个天然语言处理工具,旨在持续、每日整理1 000多亿份推文,从2008年到2021年,包含约1万亿1克的推文。我们每天将推文破碎成单格、大rams和三格朗,覆盖100多种语言。我们跟踪ngram使用频率,生成Zipf发行的文字、标签、处理器、数字、符号和emojis。我们通过一个互动的时间序列提供数据集,作为可下载的时间序列,从2008年到2021年。尽管Storrowwleral的推介路段数据可扩展为我们社交媒体的动态跟踪工具,但我们通过许多社会变现式的推介式推介式推介式推介了我们的社会数据,我们的社会直判的推介式的推介式推介方法可以使社会直判的推介一系列推介式推介的推介一系列的推介方法可以使社会数据。