Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}.
翻译:自然语言和源代码之间的翻译有助于软件开发,使开发者能够理解、概念、搜索和编写自然语言的计算机程序。尽管业界和研究界的兴趣日益浓厚,但由于缺乏适合于培训深神经模型、标准噪音清除方法和评价基准的大型标准数据集,这项任务往往很困难。这使得研究人员可以收集新的小规模数据集,造成出版作品之间的不一致。在本研究中,我们介绍了CoDesc -- -- 一个由420万爪哇方法和自然语言描述组成的大型平行数据集。我们进行了广泛的分析,确定并删除了数据集中普遍存在的噪音模式。我们展示了CoDesc在代码描述配对的两个互补任务中的熟练程度:代码拼凑和代码搜索。我们显示,该数据集有助于将代码搜索提高到22 ⁇,并实现代码拼凑中的新状态。此外,我们展示了CoDesc在培训前-fine调整设置中的有效性,为Java建立预先培训的语言模型打开了各种可能性。为了便利未来的研究,我们发布了数据基准/Cobuls/Cogrets。