关于软件工程相互作用的当代毒性探测器基准研究 (A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions)

Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter this challenge, an SE specific tool has been proposed by the CMU Strudel Lab (referred as the `STRUDEL' hereinafter) by combining the output of the Perspective API with the output from a customized version of the Stanford's Politeness detector tool. However, since STRUDEL's evaluation was very limited with only 654 SE text, its practical applicability is unclear. Therefore, this study aims to empirically evaluate the Strudel tool as well as four state-of-the-art general purpose toxicity detectors on a large scale SE dataset. On this goal, we empirically developed a rubric to manually label toxic SE interactions. Using this rubric, we manually labeled a dataset of 6,533 code review comments and 4,140 Gitter messages. The results of our analyses suggest significant degradation of all tools' performances on our datasets. Those degradations were significantly higher on our dataset of formal SE communication such as code review than on our dataset of informal communication such as Gitter messages. Two of the models from our study showed significant performance improvements during 10-fold cross validations after we retrained those on our SE datasets. Based on our manual investigations of the incorrectly classified text, we have identified several recommendations for developing an SE specific toxicity detector.

翻译：自动过滤有毒对话可能有助于开放源码软件群(OSS)保持项目参与者之间的健康互动。虽然存在若干用于识别有毒内容的一般用途工具,但这些工具可能错误地将软件工程(SE)背景下常用的一些词标记为有毒(例如“junk”、“kill”和“dump”),反之亦然。为了应对这一挑战,CMU Strudel实验室(以下称为“STRUDEEL”)提出了一个SE专用工具,将“展望”API的输出与斯坦福理学探测器定制版本的输出结合起来。然而,由于STRUDEL的评估非常有限,只有654 SE文本,其实际适用性也不清楚。因此,本研究的目的是对STRuddel工具进行实证性评估,以及大规模SE数据集的四种状态一般毒性检测器。关于这个目标,我们实验性地为SEE互动的标签。使用这个图,我们手动标记了6,533的SOL性能检测器的数据集集, 也就是我们在SEELOD的高级数据分析中,我们对SE的精确度数据分析中的所有数据分析。我们在SELODSAL的数值分析中, 的数值分析中, 的数值分析中,我们对SELADL的所有数据分析中, 做了大量数据分析。