简化收集和标签做法限制了Twitter Bot检测基准数据集的效用 (Simplistic Collection and Labeling Practices Limit the Utility of Benchmark Datasets for Twitter Bot Detection)

Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.

翻译：准确检测机器人对于在线平台的安全和完整性是必要的。这对于研究机器人在选举、错误信息传播和金融市场操纵中的影响也至关重要。平台部署基础设施以标出或删除自动账户,但其工具和数据不能公开。因此, 公众必须依靠第三方机器人检测。这些工具利用机器学习,并常常在对现有数据集进行分类时达到近乎完美的性能, 这表明机器人检测是准确、可靠和适合下游应用程序使用的。我们提供的证据表明,情况并非如此,并表明高性能归因于数据收集和标签的局限性,而不是工具的精密性能。具体而言,我们展示了简单的决策规则 -- -- 以少量特征培训的浅决策树 -- -- 在大多数可用的数据集上实现近于最先进的性能,以及机器人检测数据集即使组合在一起,也并不普遍适用于外层数据集。我们的研究结果显示, 预测高度依赖每个数据集的收集和标签程序,而不是在使用机器人和人类检测前工具方面的基本差异。这些测试结果在取样过程中具有重要影响。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日