Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD features such as 'TODO' and 'FIXME' for SATD detection. A closer look reveals several SATD research uses simple SATD ('Easy to Find') code comments without the contextual data (preceding and succeeding source code context). This work addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects with a total of 435 million LOC. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 500,000 comments labeled as SATD, including both 'Easy to Find' and 'Hard to Find' SATD. We believe PENTACET data will further SATD research using Artificial Intelligence techniques.
翻译:本文的大多数自认为的技术债(SATD)研究使用显式SATD功能(例如'TODO'和'FIXME')进行SATD检测。更详细的观察揭示了几项SATD研究使用简单的SATD(“易于查找”)代码注释而没有上下文数据(前导和后继源代码上下文)。本文通过PENTACET(或5C数据)数据解决了这一差距。PENTACET是一组大型的“按贡献者分类的上下文代码注释”和最广泛的SATD数据。我们挖掘了9096个开源软件Java项目,共计4350万行代码。结果是一个数据集,其中包括2300万行代码注释,每个注释都有前导和后继源代码上下文,以及50万个被标记为SATD的注释,包括“易于查找”和“难以查找”的注释。我们相信,PENTACET数据将进一步使用人工智能技术进行SATD研究。