代码S:源代码学习的分布转移基准数据集 (CodeS: A Distribution Shift Benchmark Dataset for Source Code Learning)

Over the past few years, deep learning (DL) has been continuously expanding its applications and becoming a driving force for large-scale source code analysis in the big code era. Distribution shift, where the test set follows a different distribution from the training set, has been a longstanding challenge for the reliable deployment of DL models due to the unexpected accuracy degradation. Although recent progress on distribution shift benchmarking has been made in domains such as computer vision and natural language process. Limited progress has been made on distribution shift analysis and benchmarking for source code tasks, on which there comes a strong demand due to both its volume and its important role in supporting the foundations of almost all industrial sectors. To fill this gap, this paper initiates to propose CodeS, a distribution shift benchmark dataset, for source code learning. Specifically, CodeS supports 2 programming languages (i.e., Java and Python) and 5 types of code distribution shifts (i.e., task, programmer, time-stamp, token, and CST). To the best of our knowledge, we are the first to define the code representation-based distribution shifts. In the experiments, we first evaluate the effectiveness of existing out-of-distribution detectors and the reasonability of the distribution shift definitions and then measure the model generalization of popular code learning models (e.g., CodeBERT) on classification task. The results demonstrate that 1) only softmax score-based OOD detectors perform well on CodeS, 2) distribution shift causes the accuracy degradation in all code classification models, 3) representation-based distribution shifts have a higher impact on the model than others, and 4) pre-trained models are more resistant to distribution shifts. We make CodeS publicly available, enabling follow-up research on the quality assessment of code learning models.

翻译：过去几年来,深层次学习(DL)一直在不断扩大其应用范围,并成为大代码时代大规模源代码分析的驱动力。在测试组采用与培训组不同的分布方式的情况下,分布变换是可靠部署DL模型的一个长期挑战,原因是出乎意料的准确性退化。虽然最近在诸如计算机愿景和自然语言流程等领域在分配转移基准基准方面取得了进展。在分配转移分析和源代码任务基准方面进展有限,由于其数量和在支持几乎所有工业部门基础方面的重要作用,对源代码任务的需求也很大。为了填补这一空白,本文开始提出代码S,即分配基准数据集数据集,用于源代码学习。具体地说,代码S支持2种编程语言(即爪哇和皮森)和5种代码分配变化(即任务、程序员、时间戳、标志和科技委)等领域。根据我们的知识,我们首先确定了基于代码的分布变换。在实验中,我们首先评估了代码变换模式,我们首先评估了当前销售代号分类法的变换效率,然后又演示了标准变码的变换方法。