Despite the recent trend of creating source code models and applying them to software engineering tasks, the quality of such models is insufficient for real-world application. In this work, we focus on improving existing code learning models from the data-centric perspective instead of designing new source code models. We shed some light on this direction by using a so-called data-influence method to identify noisy samples of pre-trained code learning models. The data-influence method is to assess the similarity of a target sample to the correct samples to determine whether or not such the target sample is noisy. The results of our evaluation show that data-influence methods can identify noisy samples for the code classification and defection prediction tasks. We envision that the data-centric approach will be a key driver for developing source code models that are useful in practice.
翻译:尽管最近出现了创建源代码模型并将其应用于软件工程任务的趋势,但这类模型的质量不足以用于现实世界应用。在这项工作中,我们侧重于从数据中心角度改进现有的代码学习模型,而不是设计新的源代码模型。我们通过使用所谓的数据影响方法来查明受过培训的代码学习模型的杂乱样本,从而在一定程度上揭示了这一方向。数据影响方法是评估目标样本与正确样本的相似性,以确定目标样本是否吵闹。我们的评估结果表明,数据影响方法可以确定代码分类和叛变预测任务中的吵闹样本。我们设想,数据中心方法将成为开发实际有用的源代码模型的关键驱动因素。