Twitter data have become essential to Natural Language Processing (NLP) and social science research, driving various scientific discoveries in recent years. However, the textual data alone are often not enough to conduct studies: especially social scientists need more variables to perform their analysis and control for various factors. How we augment this information, such as users' location, age, or tweet sentiment, has ramifications for anonymity and reproducibility, and requires dedicated effort. This paper describes Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with additional information about tweets and users. Twitter-Demographer is aimed at NLP practitioners and (computational) social scientists who want to enrich their datasets with aggregated information, facilitating reproducibility, and providing algorithmic privacy-by-design measures for pseudo-anonymity. We discuss our design choices, inspired by the flow-based programming paradigm, to use black-box components that can easily be chained together and extended. We also analyze the ethical issues related to the use of this tool, and the built-in measures to facilitate pseudo-anonymity.
翻译:然而,光靠文字数据往往不足以进行研究:特别是社会科学家需要更多的变量来进行分析和控制各种因素。我们如何扩大这种信息,例如用户的位置、年龄或推特情绪,对匿名和复制都有影响,需要专门的努力。本文描述了Twitter-Degrapher,这是一个简单、流动的工具,可以丰富Twitter数据,并提供有关Twitter和用户的补充信息。Twitter-Degrapher针对的是NLP的执业者和(虚拟的)社会科学家,他们希望用综合信息来丰富数据集,促进复制,并提供伪匿名的按逻辑的隐私逐个计量措施。我们讨论了我们受基于流动的编程范式启发的设计选择,以便使用易于连锁和扩展的黑盒组件。我们还分析了与使用这一工具有关的伦理问题,以及便利伪匿名的内在措施。