PENTACET数据-2300万上下文代码注释和50万SATD注释 (PENTACET data -- 23 Million Contextual Code Comments and 500,000 SATD comments) - 专知论文

会员服务 ·

0

注释（编程） · 上下文 · 代码 · 智能技术 · Java ·

2023 年 3 月 24 日

PENTACET data -- 23 Million Contextual Code Comments and 500,000 SATD comments

翻译：PENTACET数据-2300万上下文代码注释和50万SATD注释

Murali Sridharan,Leevi Rantala,Mika Mäntylä

from arxiv, Accepted in MSR 2023 Tools and Data Showcase

Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD features such as 'TODO' and 'FIXME' for SATD detection. A closer look reveals several SATD research uses simple SATD ('Easy to Find') code comments without the contextual data (preceding and succeeding source code context). This work addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects with a total of 435 million LOC. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 500,000 comments labeled as SATD, including both 'Easy to Find' and 'Hard to Find' SATD. We believe PENTACET data will further SATD research using Artificial Intelligence techniques.

翻译：本文的大多数自认为的技术债（SATD）研究使用显式SATD功能（例如'TODO'和'FIXME'）进行SATD检测。更详细的观察揭示了几项SATD研究使用简单的SATD（“易于查找”）代码注释而没有上下文数据（前导和后继源代码上下文）。本文通过PENTACET（或5C数据）数据解决了这一差距。PENTACET是一组大型的“按贡献者分类的上下文代码注释”和最广泛的SATD数据。我们挖掘了9096个开源软件Java项目，共计4350万行代码。结果是一个数据集，其中包括2300万行代码注释，每个注释都有前导和后继源代码上下文，以及50万个被标记为SATD的注释，包括“易于查找”和“难以查找”的注释。我们相信，PENTACET数据将进一步使用人工智能技术进行SATD研究。

0

相关内容

注释（编程）

注释（编程）

注释（编程）

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

代码注释最详细的Transformer

代码注释最详细的Transformer

专知会员服务

112+阅读 · 2022年6月30日

美国陆军部最新版《赛博战与电磁战》FM 3-12文件（2021年），CYBERSPACE OPERATIONS AND ELECTROMAGNETIC WARFARE

美国陆军部最新版《赛博战与电磁战》FM 3-12文件（2021年），CYBERSPACE OPERATIONS AND ELECTROMAGNETIC WARFARE

专知会员服务

118+阅读 · 2022年4月16日

Effective.Modern.C++ 中英文版，334页pdf

Effective.Modern.C++ 中英文版，334页pdf

专知会员服务

68+阅读 · 2020年11月4日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【2020关键词提取】医学报告的关键词提取和结构化，Keyword extraction and structuralization of medical reports

【2020关键词提取】医学报告的关键词提取和结构化，Keyword extraction and structuralization of medical reports

专知会员服务

33+阅读 · 2020年5月2日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

专知

19+阅读 · 2018年3月26日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

印度芥菜BjHMA4转运蛋白C-末端结构域的位置和功能研究

国家自然科学基金

0+阅读 · 2015年12月31日

一种新型植物毒蛋白--蒜头果蛋白的基因克隆与结构初步分析

国家自然科学基金

0+阅读 · 2015年12月31日

千核级通用微处理器共享存储体系结构研究

国家自然科学基金

0+阅读 · 2014年12月31日

质子泵抑制剂下调ATP6V1A抑制自噬影响胃腺癌多药耐药的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于生物医学文献和领域本体的蛋白质复合物预测方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

两栖动物镇痛肽odorranaopin结构与功能研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于HA和NS1蛋白结构的甲型流感病毒传播与致病机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

苏云金芽胞杆菌Cry1Ac蛋白构域Ⅲ定点突变及结构与功能研究

国家自然科学基金

0+阅读 · 2011年12月31日

支持海量非结构数据可视化分析的存储与索引

国家自然科学基金

0+阅读 · 2010年12月31日

抗生素压力对嗜麦芽窄食单胞菌致病性的影响

国家自然科学基金

0+阅读 · 2009年12月31日

Satisfiability-Aided Language Models Using Declarative Prompting

Arxiv

0+阅读 · 2023年5月16日

Towards Expert-Level Medical Question Answering with Large Language Models

Arxiv

26+阅读 · 2023年5月16日

Efficacy of Educational Misinformation Games

Arxiv

0+阅读 · 2023年5月16日

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Arxiv

0+阅读 · 2023年5月16日

Unlearnable Examples Give a False Sense of Security: Piercing through Unexploitable Data with Learnable Examples

Arxiv

0+阅读 · 2023年5月16日

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Arxiv

0+阅读 · 2023年5月15日

Sanity checks and improvements for patch visualisation in prototype-based image classification

Arxiv

0+阅读 · 2023年5月15日

A Language Model of Java Methods with Train/Test Deduplication

Arxiv

0+阅读 · 2023年5月15日

Bitcoin-Enhanced Proof-of-Stake Security: Possibilities and Impossibilities

Arxiv

0+阅读 · 2023年5月13日

Engagement Decision Support for Beyond Visual Range Air Combat

Engagement Decision Support for Beyond Visual Range Air Combat

Arxiv

63+阅读 · 2021年11月4日

VIP会员

文章信息

相关主题

注释（编程）

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

代码注释最详细的Transformer

代码注释最详细的Transformer

专知会员服务

112+阅读 · 2022年6月30日

美国陆军部最新版《赛博战与电磁战》FM 3-12文件（2021年），CYBERSPACE OPERATIONS AND ELECTROMAGNETIC WARFARE

美国陆军部最新版《赛博战与电磁战》FM 3-12文件（2021年），CYBERSPACE OPERATIONS AND ELECTROMAGNETIC WARFARE

专知会员服务

118+阅读 · 2022年4月16日

Effective.Modern.C++ 中英文版，334页pdf

Effective.Modern.C++ 中英文版，334页pdf

专知会员服务

68+阅读 · 2020年11月4日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【2020关键词提取】医学报告的关键词提取和结构化，Keyword extraction and structuralization of medical reports

【2020关键词提取】医学报告的关键词提取和结构化，Keyword extraction and structuralization of medical reports

专知会员服务

33+阅读 · 2020年5月2日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

DeepSeek-V3.2-Exp 论文快速解读

大语言模型与视觉模型中的幻觉现象理解综述

【NeurIPS2025】Instant4D：高效的4D高斯喷溅方法

【NTU博士论文】利用强化学习与生成模型推动可靠且具备泛化能力的决策

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

专知

19+阅读 · 2018年3月26日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

相关论文

Satisfiability-Aided Language Models Using Declarative Prompting

Arxiv

0+阅读 · 2023年5月16日

Towards Expert-Level Medical Question Answering with Large Language Models

Arxiv

26+阅读 · 2023年5月16日

Efficacy of Educational Misinformation Games

Arxiv

0+阅读 · 2023年5月16日

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Arxiv

0+阅读 · 2023年5月16日

Unlearnable Examples Give a False Sense of Security: Piercing through Unexploitable Data with Learnable Examples

Arxiv

0+阅读 · 2023年5月16日

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Arxiv

0+阅读 · 2023年5月15日

Sanity checks and improvements for patch visualisation in prototype-based image classification

Arxiv

0+阅读 · 2023年5月15日

A Language Model of Java Methods with Train/Test Deduplication

Arxiv

0+阅读 · 2023年5月15日

Bitcoin-Enhanced Proof-of-Stake Security: Possibilities and Impossibilities

Arxiv

0+阅读 · 2023年5月13日

Engagement Decision Support for Beyond Visual Range Air Combat

Engagement Decision Support for Beyond Visual Range Air Combat

Arxiv

63+阅读 · 2021年11月4日

相关基金

印度芥菜BjHMA4转运蛋白C-末端结构域的位置和功能研究

国家自然科学基金

0+阅读 · 2015年12月31日

一种新型植物毒蛋白--蒜头果蛋白的基因克隆与结构初步分析

国家自然科学基金

0+阅读 · 2015年12月31日

千核级通用微处理器共享存储体系结构研究

国家自然科学基金

0+阅读 · 2014年12月31日

质子泵抑制剂下调ATP6V1A抑制自噬影响胃腺癌多药耐药的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于生物医学文献和领域本体的蛋白质复合物预测方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

两栖动物镇痛肽odorranaopin结构与功能研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于HA和NS1蛋白结构的甲型流感病毒传播与致病机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

苏云金芽胞杆菌Cry1Ac蛋白构域Ⅲ定点突变及结构与功能研究

国家自然科学基金

0+阅读 · 2011年12月31日

支持海量非结构数据可视化分析的存储与索引

国家自然科学基金

0+阅读 · 2010年12月31日

抗生素压力对嗜麦芽窄食单胞菌致病性的影响

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员