The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce Code2Doc, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6% satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.
翻译:自动代码文档生成模型的性能关键取决于用于监督的训练数据质量。然而,现有的大多数代码文档数据集是通过大规模爬取公共代码库构建的,质量控制有限。因此,这些数据集通常包含噪声文档、大量重复以及日益增多的AI生成内容污染。这些问题削弱了基于学习的模型可获得的监督信号,并使评估复杂化。我们提出了Code2Doc,一个质量优先的、用于函数级代码文档生成的精选数据集。Code2Doc包含从广泛使用的开源代码库中提取的13,358个高质量函数-文档对,涵盖五种编程语言:Python、Java、TypeScript、JavaScript和C++。该数据集通过一个四阶段的精选流程构建,该流程强制执行文档的完整性和清晰度,基于结构和复杂性标准筛选函数,移除精确和近似重复的代码,并识别可能由AI生成的文档。从52,069个初始提取的候选对开始,仅有25.6%满足所有质量约束。我们对最终数据集进行了详细分析,其平均文档质量得分达到6.93分(满分10分)。总体而言,86.9%的样本包含显式类型注解,仅有2.9%被标记为可能由AI生成。基线实验表明,在Code2Doc上微调一个大语言模型,相对于零样本性能,在BLEU和ROUGE-L指标上分别获得了29.47%和24.04%的相对提升,尽管数据集规模适中。我们同时发布了数据集和完整的精选流程,以支持自动代码文档生成的可复现研究。