CoDesc:大型代码描述平行数据集 (CoDesc: A Large Code-Description Parallel Dataset)

Masum Hasan,Tanveer Muttaqueen,Abdullah Al Ishtiaq,Kazi Sajeed Mehrab,Md. Mahim Anjum Haque,Tahmid Hasan,Wasi Uddin Ahmad,Anindya Iqbal,Rifat Shahriyar

from arxiv, Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready)

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}.

翻译：自然语言和源代码之间的翻译有助于软件开发,使开发者能够理解、概念、搜索和编写自然语言的计算机程序。尽管业界和研究界的兴趣日益浓厚,但由于缺乏适合于培训深神经模型、标准噪音清除方法和评价基准的大型标准数据集,这项任务往往很困难。这使得研究人员可以收集新的小规模数据集,造成出版作品之间的不一致。在本研究中,我们介绍了CoDesc -- -- 一个由420万爪哇方法和自然语言描述组成的大型平行数据集。我们进行了广泛的分析,确定并删除了数据集中普遍存在的噪音模式。我们展示了CoDesc在代码描述配对的两个互补任务中的熟练程度:代码拼凑和代码搜索。我们显示,该数据集有助于将代码搜索提高到22 ⁇,并实现代码拼凑中的新状态。此外,我们展示了CoDesc在培训前-fine调整设置中的有效性,为Java建立预先培训的语言模型打开了各种可能性。为了便利未来的研究,我们发布了数据基准/Cobuls/Cogrets。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

【2020新书】Web应用安全，331页pdf

专知会员服务

25+阅读 · 2020年10月24日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日