Detecting the presence of bots in distributed software development activity is very important in order to prevent bias in large-scale socio-technical empirical analyses. In previous work, we proposed a classification model to detect bots in GitHub repositories based on the pull request and issue comments of GitHub accounts. The current study generalises the approach to git contributors based on their commit messages. We train and evaluate the classification model on a large dataset of 6,922 git contributors. The original model based on pull request and issue comments obtained a precision of 0.77 on this dataset. Retraining the classification model on git commit messages increased the precision to 0.80. As a proof-of-concept, we implemented this model in BoDeGiC, an open source command-line tool to detect bots in git repositories.
翻译:检测分布式软件开发活动中存在机器人的存在非常重要,以防止大规模社会技术经验分析中的偏差。在以往的工作中,我们根据拉动请求提出了一个分类模型,以探测GitHub仓库中的机器人,并发表GitHub账户的评论。当前研究概括了基于其承诺信息对投稿者采用的方法。我们培训和评价了6 922 git提供方的大型数据集的分类模型。基于拉动请求和发布评论的原始模型在这个数据集上获得了0.77的精确度。对Git承诺信息分类模型的再培训将精确度提高到0.80。作为概念的证明,我们在BoDeGic应用了这一模型,这是一个用于检测Git储存库中的机器人的开放源指令-线工具。