Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.
翻译:尽管最近出现了在软件工程任务中开发和应用神经源代码模型的趋势,但这类模型的质量不足以用于实际使用,这是因为用于培训此类模型的源代码公司可能出现噪音。我们在本文件中对数据影响方法进行了调整,以检测此类噪音。数据影响方法用于机器学习,以评价目标样本与正确样本的相似性,从而确定目标样本是否噪音。我们的评价结果显示,数据影响方法可以在基于分类的任务中从神经代码模型中找出噪音样本。这一方法将有助于从以数据为中心的角度开发更好的神经源代码模型这一更大的愿景,而该模型是在实践中开发有用源代码模型的关键驱动因素。