Accurate bot detection is necessary for the safety and integrity of online platforms. It is also crucial for research on the influence of bots in elections, the spread of misinformation, and financial market manipulation. Platforms deploy infrastructure to flag or remove automated accounts, but their tools and data are not publicly available. Thus, the public must rely on third-party bot detection. These tools employ machine learning and often achieve near perfect performance for classification on existing datasets, suggesting bot detection is accurate, reliable and fit for use in downstream applications. We provide evidence that this is not the case and show that high performance is attributable to limitations in dataset collection and labeling rather than sophistication of the tools. Specifically, we show that simple decision rules -- shallow decision trees trained on a small number of features -- achieve near-state-of-the-art performance on most available datasets and that bot detection datasets, even when combined together, do not generalize well to out-of-sample datasets. Our findings reveal that predictions are highly dependent on each dataset's collection and labeling procedures rather than fundamental differences between bots and humans. These results have important implications for both transparency in sampling and labeling procedures and potential biases in research using existing bot detection tools for pre-processing.
翻译:准确检测机器人对于在线平台的安全和完整性是必要的。 这对于研究机器人在选举、错误信息传播和金融市场操纵中的影响也至关重要。 平台部署基础设施以标出或删除自动账户,但其工具和数据不能公开。 因此, 公众必须依靠第三方机器人检测。 这些工具利用机器学习,并常常在对现有数据集进行分类时达到近乎完美的性能, 这表明机器人检测是准确、 可靠和适合下游应用程序使用的。 我们提供的证据表明,情况并非如此,并表明高性能归因于数据收集和标签的局限性,而不是工具的精密性能。 具体而言,我们展示了简单的决策规则 -- -- 以少量特征培训的浅决策树 -- -- 在大多数可用的数据集上实现近于最先进的性能,以及机器人检测数据集即使组合在一起,也并不普遍适用于外层数据集。 我们的研究结果显示, 预测高度依赖每个数据集的收集和标签程序,而不是在使用机器人和人类检测前工具方面的基本差异。 这些测试结果在取样过程中具有重要影响。