AI modeling for source code understanding tasks has been making significant progress, and is being adopted in production development pipelines. However, reliability concerns, especially whether the models are actually learning task-related aspects of source code, are being raised. While recent model-probing approaches have observed a lack of signal awareness in many AI-for-code models, i.e. models not capturing task-relevant signals, they do not offer solutions to rectify this problem. In this paper, we explore data-driven approaches to enhance models' signal-awareness: 1) we combine the SE concept of code complexity with the AI technique of curriculum learning; 2) we incorporate SE assistance into AI models by customizing Delta Debugging to generate simplified signal-preserving programs, augmenting them to the training dataset. With our techniques, we achieve up to 4.8x improvement in model signal awareness. Using the notion of code complexity, we further present a novel model learning introspection approach from the perspective of the dataset.
翻译:对源代码进行建模的脱衣示范工作取得了显著进展,并正在生产开发管道中采用。然而,人们正在提出可靠性问题,特别是这些模型是否实际上学习源代码与任务有关的方面。虽然最近的示范检验方法发现在许多AI-代号模型中缺乏信号意识,即没有捕捉与任务相关的信号的模型,但它们并没有提供纠正这一问题的解决方案。在本文件中,我们探索了数据驱动方法,以提高模型的信号意识:1)我们把代码复杂性的SE概念与AI课程学习技术结合起来;2)我们通过定制德尔塔调试来将SE援助纳入AI模型,以生成简化的信号保存程序,并将它们添加到培训数据集中。我们用技术,在模型信号意识方面实现了4.8x的改进。我们利用代码复杂性的概念,从数据集的角度进一步介绍了新的模型学习引言方法。