GitHub 问题和PR评论中检测生物体的地面真相数据集和分类模型 (A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments)

Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high weighted average precision, recall and F1-score of 0.98 on a test set containing 40% of the data. We integrated the classification model into an open source command-line tool to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.

翻译：Github 仓库经常使用 Bots 来自动处理作为分布式软件开发过程一部分的重复性活动,它们通过评论与人类行为者进行交流。在发现它们的存在很重要,原因很多,但发现它们的存在很重要,没有大型和有代表性的地面真象数据集,也没有根据这种数据集检测和验证机器人的分类模型。本文根据高跨者协议的人工分析,提出一个地面真象数据集,在5000个不同的Github账户中提出拉动请求和发表评论,其中527个账户已被确定为机器人。我们利用这个数据集,提出一个自动分类模型来检测机器人,主要特征是每个账户的空和非空的评论数量,评论模式的数量,以及评论模式内部评论的不平等性。我们在包含40%数据的测试集中获得了一个非常高的加权平均精确度、回顾和F1核心0.98。我们将分类模型纳入了一个开放源指令-线工具,以便操作者能够检测某个特定Github 仓库的账户中哪些账户实际上与机器人对应。