Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents to train automated classifiers. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. Moreover, we replicate two recently published articles and reach the same substantive conclusions with only a small proportion of the original labeled data used in those studies. We provide activeText, an open-source software to implement our method.
翻译:社会科学家经常对文本文件进行分类,将由此产生的标签用作实验性研究的结果或预测。自动文本分类已经成为一个标准工具,因为它要求的人类编码较少。然而,学者仍然需要许多人类标签文件来培训自动化分类人员。为了降低标签成本,我们提议了一个新的文本分类算法,将概率模型与积极学习结合起来。概率模型同时使用标签和未标签数据,积极学习集中对难以分类的文件进行分类。我们的验证研究表明,我们算法的分类性能与计算成本的一小部分最先进的方法相当。此外,我们复制了最近出版的两篇文章,并得出了相同的实质性结论,这些研究中使用的原始标签数据只有一小部分。我们提供了活性Text,这是一个用于执行我们方法的开放源软件。