我们是否建在岩石上?关于数据预处理对代码摘要的重要性 (Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization)

Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.

翻译：代码总和,即根据代码生成有用评论的任务,长期以来一直令人感兴趣。大多数现有的代码总和模型都是在广泛使用的代码总和基准数据集中培训和验证的。然而,对于从现实世界项目中建立的基准数据集的质量知之甚少。基准数据集是否如预期那样好?为了缩小差距,我们进行系统研究,评估并改进广泛用于代码总和任务的4个基准数据集的质量。首先,我们提议一个自动化代码总和清理工具,能够准确检测现有基准数据集中不适当的数据处理前处理操作造成的噪音数据。然后,我们运用该工具进一步评估四个基准数据集的数据质量,基于所检测到的噪音。最后,我们进行比较实验,以调查噪音数据对代码总和齐模型性能的影响。结果显示,所有4个基准数据集中都广泛存在这些预处理噪音,消除这些噪音导致代码总和性能的显著改进。我们认为,这些发现和洞察力将有助于更好地了解代码总和研究方法中的数据质量。

相关内容

数据预处理

关注 1176

数据预处理（data preprocessing）是指在主要的处理以前对数据进行的一些处理。如对大部分地球物理面积性观测数据在进行转换或增强处理之前，首先将不规则分布的测网经过插值转换为规则网的处理，以利于计算机的运算。另外，对于一些剖面测量数据，如地震资料预处理有垂直叠加、重排、加道头、编辑、重新取样、多路编辑等。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日