Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich unlabeled code data. However, the lack of high quality labeled data has largely hindered the progress of several code related tasks, such as program translation, summarization, synthesis, and code search. This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We also provide the performance of several state-of-the-art baseline models for each task. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence.
翻译:最近在机器学习方面的进展大大增进了对源代码数据的理解,并在许多下游任务上取得了良好的业绩。GitHub等开放源库以丰富的无标签代码数据使这一过程得以进行。然而,由于缺少高质量的标签数据,在很大程度上阻碍了若干代码相关任务的进展,如程序翻译、汇总、合成和代码搜索。本文介绍了XLCOST、跨语言代码Snippet数据集,这是跨语言代码情报的新的基准数据集。我们的数据集包含来自8种语言(7种常用的编程语言和英语)的细微平行数据,支持了10项跨语言代码任务。据我们所知,这是源代码在大小和语言数量方面最大的平行数据集。我们还为每项任务提供了几种最先进的基线模型的性能。我们认为,这个新的数据集可以成为研究界的宝贵资产,有助于跨语言代码情报新方法的开发和验证。