Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems. A large part of technical debt is explicitly reported by the developers themselves; this is commonly referred to as Self-Admitted Technical Debt or SATD. Previous work has focused on identifying SATD from source code comments and issue trackers. However, there are no approaches available for automatically identifying SATD from other sources such as commit messages and pull requests, or by combining multiple sources. Therefore, we propose and evaluate an approach for automated SATD identification that integrates four sources: source code comments, commit messages, pull requests, and issue tracking systems. Our findings show that our approach outperforms baseline approaches and achieves an average F1-score of 0.611 when detecting four types of SATD (i.e., code/design debt, requirement debt, documentation debt, and test debt) from the four aforementioned sources. Thereafter, we analyze 23.6M code comments, 1.3M commit messages, 3.7M issue sections, and 1.7M pull request sections to characterize SATD in 103 open-source projects. Furthermore, we investigate the SATD keywords and relations between SATD in different sources. The findings indicate, among others, that: 1) SATD is evenly spread among all sources; 2) issues and pull requests are the two most similar sources regarding the number of shared SATD keywords, followed by commit messages, and then followed by code comments; 3) there are four kinds of relations between SATD items in the different sources.
翻译:技术债务指的是为了实现短期目标而牺牲软件系统长期可维护性和可扩展性的捷径。大部分技术债务是开发人员明确报告的;这通常称为自我承认技术债务(SATD)。先前的工作集中在从源代码注释和问题跟踪器中识别 SATD 上。然而,目前没有可用于自动从其他来源(例如提交消息和拉取请求)或通过结合多个来源自动识别 SATD 的方法。因此,我们提出并评估了一种自动 SATD 识别方法,该方法集成了四个来源:源代码注释、提交消息、拉取请求和问题跟踪系统。我们的研究结果显示,相对于基线方法,我们的方法在从前述四个来源中检测四种类型的 SATD(即代码或设计债务、需求债务、文档债务和测试债务)方面表现出色,并实现了平均 F1-score 0.611。此后,我们分析了 23.6M 个代码注释,1.3M 个提交消息,3.7M 个问题部分和 1.7M 个拉取请求部分,以描述 103 个开源项目中的 SATD。此外,我们调查了 SATD 的关键字和不同来源之间的关系。研究结果表明,其中包括:1) SATD 在所有来源中分布均匀;2) 问题和拉取请求是共享 SATD 关键字数量最多的两个来源,其次是提交消息,最后是代码注释;3) 不同来源中的 SATD 项之间有四种关系。