The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for developing automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In this paper, we discuss how these techniques can be used to detect political content across different platforms. Using three validation datasets, which include a variety of political and non-political textual documents from online platforms, we systematically compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. We also examine the impact of different modes of data preprocessing (e.g. stemming and stopword removal) on the low-cost implementations of these techniques using a large set (n = 66) of detection models. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models, in contrast to the more robust performance of dictionary-based models on noisy data.
翻译:在线信息行为数据不断增多,为政治通信研究提供了新的可能性。然而,这些数据的数量和种类之多,使得难以分析和促使需要根据广泛的自然语言处理技术(例如机器学习或神经网络技术)开发自动内容方法。在本文件中,我们讨论了如何利用这些技术在不同平台中探测政治内容的问题。我们利用三个验证数据集,包括来自在线平台的各种政治和非政治文本文件,系统地比较依赖字典、监督机学习或神经网络的三类探测技术的性能。我们还利用一套大型探测模型(n=66)审查不同数据预处理模式对低成本实施这些技术的影响。我们的结果显示,预处理对模型性能的影响有限,而神经网络和机器学习模型取得的最佳结果则是不那么噪音的数据,这与以词典为基础的模型对噪音数据进行更稳健的业绩形成对比。