政治研究中的小数据问题:一项关键的复制研究 (Small data problems in political research: a critical replication study)

from arxiv, 9 pages, 1 figure, 1st Workshop on Computational Linguistics for Political Text Analysis (CPSS-2021) D\"usseldorf, Germany Part of proceedings. Full proceedings: https://gscl.org/media/pages/arbeitskreise/cpss/cpss-2021/workshop-proceedings/352683648-1631172151/cpss2021-proceedings.pdf

In an often-cited 2019 paper on the use of machine learning in political research, Anastasopoulos & Whitford (A&W) propose a text classification method for tweets related to organizational reputation. The aim of their paper was to provide a 'guide to practice' for public administration scholars and practitioners on the use of machine learning. In the current paper we follow up on that work with a replication of A&W's experiments and additional analyses on model stability and the effects of preprocessing, both in relation to the small data size. We show that (1) the small data causes the classification model to be highly sensitive to variations in the random train-test split, and that (2) the applied preprocessing causes the data to be extremely sparse, with the majority of items in the data having at most two non-zero lexical features. With additional experiments in which we vary the steps of the preprocessing pipeline, we show that the small data size keeps causing problems, irrespective of the preprocessing choices. Based on our findings, we argue that A&W's conclusions regarding the automated classification of organizational reputation tweets -- either substantive or methodological -- can not be maintained and require a larger data set for training and more careful validation.

翻译：Anastasopoulos & Whitford(A&W)在一份关于政治研究中使用机器学习的2019年论文中经常提到,Anastasopoulos & Whitford(A&W)提出了与组织声誉有关的推文分类方法。他们论文的目的是为公共行政学者和从业者提供关于使用机器学习的“实践指南”。在本论文中,我们通过复制A&W的实验和对模型稳定性和预处理影响的额外分析来跟踪这项工作,这与数据规模小有关。我们表明:(1) 数据小,导致分类模式对随机火车测试的变异高度敏感,(2) 应用预处理使数据极为稀少,数据中的大多数项目都具有两个非零的词汇特征。我们通过进一步实验来改变预处理管道的步骤,我们表明,无论预处理前的选择如何,数据规模小,都会造成问题。根据我们的研究结果,我们认为,A&W关于组织名词自动分类的结论 -- -- 要么是实质性的,要么是方法上的 -- -- 不能维持,并且需要更仔细的数据集,以便进行更仔细的验证。