We propose CodeQA, a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs. To obtain natural and faithful questions and answers, we implement syntactic rules and semantic analysis to transform code comments into question-answer pairs. We present the construction process and conduct systematic analysis of our dataset. Experiment results achieved by several neural baselines on our dataset are shown and discussed. While research on question-answering and machine reading comprehension develops rapidly, few prior work has drawn attention to code question answering. This new dataset can serve as a useful research benchmark for source code comprehension.
翻译:我们提出代码QA,这是一个自由形式回答问题的数据,用于源代码理解:如果有一个代码片断和一个问题,则需要生成文本回答。代码QA包含一个包含119,778个问答配对的爪哇数据集和一个包含70,085个问答配对的Python数据集。为了获得自然和忠实的问答,我们实施了合成规则和语义分析,将代码评论转换成问答配对。我们介绍了构建过程,并对我们的数据集进行了系统分析。展示和讨论了我们数据集的若干神经基线所取得的实验结果。虽然关于问答和机器阅读理解的研究工作迅速发展,但很少有先前的工作提请注意代码回答。这个新的数据集可以作为源代码理解的有用研究基准。