Given the democratic nature of open source development, code review and issue discussions may be uncivil. Incivility, defined as features of discussion that convey an unnecessarily disrespectful tone, can have negative consequences to open source communities. To prevent or minimize these negative consequences, open source platforms have included mechanisms for removing uncivil language from the discussions. However, such approaches require manual inspection, which can be overwhelming given the large number of discussions. To help open source communities deal with this problem, in this paper, we aim to compare six classical machine learning models with BERT to detect incivility in open source code review and issue discussions. Furthermore, we assess if adding contextual information improves the models' performance and how well the models perform in a cross-platform setting. We found that BERT performs better than classical machine learning models, with a best F1-score of 0.95. Furthermore, classical machine learning models tend to underperform to detect non-technical and civil discussions. Our results show that adding the contextual information to BERT did not improve its performance and that none of the analyzed classifiers had an outstanding performance in a cross-platform setting. Finally, we provide insights into the tones that the classifiers misclassify.
翻译:鉴于开放源码开发的民主性质,代码审查和问题讨论可能是不文明的。文明被界定为表达不必要不尊重的语调的讨论特点,它可能对开放源码社区产生消极后果。为了防止或尽量减少这些负面后果,开放源码平台包括了将不文明语言从讨论中去除的机制。然而,鉴于讨论次数众多,这些方法需要人工检查,而这种检查可能过于庞大。为了帮助开放源码社区处理这一问题,我们在本文件中将六种古典机器学习模式与BERT比较,以便在开放源码审查和讨论中发现不文明的状态。此外,我们评估增加背景信息是否改善了模型的性能,以及模型在跨平台环境中的性能如何。我们发现,BERT比经典机器学习模式表现更好,而最好的F1核心为0.95。此外,古典机器学习模式往往不完善,无法发现非技术和民间讨论。我们的结果显示,将背景信息添加到BERT,在公开源码审查和发布讨论时,其业绩没有改善,分析分类者在跨平台环境中的杰出表现如何。最后,我们向分类者提供洞察。