停用词集合对软件工程文档的处理：它们是否很重要？ (Stop Words for Processing Software Engineering Documents: Do they Matter?)

Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domain-specific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance.

翻译：在自然语言处理任务中，通常会剔除被视为不具预测能力的停用词。然而，无信息价值的词汇的定义很模糊，因此大多数算法使用基于通用知识的停用词集来移除停用词。在学术界，关于停用词消除的实用性，特别是在特定领域的情况下，一直存在争议。在这项工作中，我们调查了停用词在软件工程背景下的实用性。为此，我们复制并实验验证了三种来自相关工作的软件工程研究工具。此外，我们从 10,000 条 Stack Overflow 问题中构建了一个软件工程领域相关文本语料库，并使用传统的信息论方法识别了 200 个领域特定停用词。我们的结果表明，使用领域特定的停用词与使用常规停用词表相比，显著提高了研究工具的性能，19 个评估指标中有 17 个表现更佳。