涂鸦黄金:从对文本数据中政治内容的平台-不可知性自动检测中吸取的经验教训 (Panning for gold: Lessons learned from the platform-agnostic automated detection of political content in textual data)

The growing availability of data about online information behaviour enables new possibilities for political communication research. However, the volume and variety of these data makes them difficult to analyse and prompts the need for developing automated content approaches relying on a broad range of natural language processing techniques (e.g. machine learning- or neural network-based ones). In this paper, we discuss how these techniques can be used to detect political content across different platforms. Using three validation datasets, which include a variety of political and non-political textual documents from online platforms, we systematically compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks. We also examine the impact of different modes of data preprocessing (e.g. stemming and stopword removal) on the low-cost implementations of these techniques using a large set (n = 66) of detection models. Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models, in contrast to the more robust performance of dictionary-based models on noisy data.

翻译：在线信息行为数据不断增多,为政治通信研究提供了新的可能性。然而,这些数据的数量和种类之多,使得难以分析和促使需要根据广泛的自然语言处理技术(例如机器学习或神经网络技术)开发自动内容方法。在本文件中,我们讨论了如何利用这些技术在不同平台中探测政治内容的问题。我们利用三个验证数据集,包括来自在线平台的各种政治和非政治文本文件,系统地比较依赖字典、监督机学习或神经网络的三类探测技术的性能。我们还利用一套大型探测模型(n=66)审查不同数据预处理模式对低成本实施这些技术的影响。我们的结果显示,预处理对模型性能的影响有限,而神经网络和机器学习模型取得的最佳结果则是不那么噪音的数据,这与以词典为基础的模型对噪音数据进行更稳健的业绩形成对比。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日