Twitter bot detection has become an increasingly important task to combat misinformation, facilitate social media moderation, and preserve the integrity of the online discourse. State-of-the-art bot detection methods generally leverage the graph structure of the Twitter network, and they exhibit promising performance when confronting novel Twitter bots that traditional methods fail to detect. However, very few of the existing Twitter bot detection datasets are graph-based, and even these few graph-based datasets suffer from limited dataset scale, incomplete graph structure, as well as low annotation quality. In fact, the lack of a large-scale graph-based Twitter bot detection benchmark that addresses these issues has seriously hindered the development and evaluation of novel graph-based bot detection approaches. In this paper, we propose TwiBot-22, a comprehensive graph-based Twitter bot detection benchmark that presents the largest dataset to date, provides diversified entities and relations on the Twitter network, and has considerably better annotation quality than existing datasets. In addition, we re-implement 35 representative Twitter bot detection baselines and evaluate them on 9 datasets, including TwiBot-22, to promote a fair comparison of model performance and a holistic understanding of research progress. To facilitate further research, we consolidate all implemented codes and datasets into the TwiBot-22 evaluation framework, where researchers could consistently evaluate new models and datasets. The TwiBot-22 Twitter bot detection benchmark and evaluation framework are publicly available at https://twibot22.github.io/
翻译:打击错误信息、促进社交媒体温和度调和保持在线对话完整性的越来越重要的任务已经日益成为打击错误信息、促进社交媒体温和和保持在线对话完整性的重要任务。 事实上,最先进的机器人检测方法通常会影响Twitter网络的图表结构,在面对传统方法无法检测到的新式Twitter机器人时,这些方法表现出有良好的表现。然而,现有的Twitter机器人检测数据集中很少有以图表为基础的,甚至这些以图表为基础的数据组也比现有的数据集大得多,图表结构不完善,而且注释质量也低。事实上,缺乏大规模基于图形的Twitter机器人检测基准,解决了这些问题,严重阻碍了基于图表的新式机器人检测方法的开发和评估。在本论文中,我们建议TwibBot-22,一个全面的基于图形的Twitter机器人检测基准,这个基于Twibbb的检测基准,该基准显示迄今为止最大的数据集,提供多样化的实体和在Twitter网络上的关系,并且比现有的数据集质量要高得多。 此外,我们重新实施35个具有代表性的Twitter检测功能的检测基准,并评价9个数据集集,包括TwibBot-22,我们不断检测和22的测试基准,促进一个公平的研究框架。