The performance of neural code search is significantly influenced by the quality of the training data from which the neural models are derived. A large corpus of high-quality query and code pairs is demanded to establish a precise mapping from the natural language to the programming language. Due to the limited availability, most widely-used code search datasets are established with compromise, such as using code comments as a replacement of queries. Our empirical study on a famous code search dataset reveals that over one-third of its queries contain noises that make them deviate from natural user queries. Models trained through noisy data are faced with severe performance degradation when applied in real-world scenarios. To improve the dataset quality and make the queries of its samples semantically identical to real user queries is critical for the practical usability of neural code search. In this paper, we propose a data cleaning framework consisting of two subsequent filters: a rule-based syntactic filter and a model-based semantic filter. This is the first framework that applies semantic query cleaning to code search datasets. Experimentally, we evaluated the effectiveness of our framework on two widely-used code search models and three manually-annotated code retrieval benchmarks. Training the popular DeepCS model with the filtered dataset from our framework improves its performance by 19.2% MRR and 21.3% Answer@1, on average with the three validation benchmarks.
翻译:神经代码搜索的性能受到导出神经模型的培训数据质量的重大影响。要求大量高质量的查询和代码配对,以建立从自然语言到编程语言的精确绘图。由于可用性有限,大多数广泛使用的代码搜索数据集都是以折中方式建立的,例如使用代码注释取代查询。我们对著名的代码搜索数据集的实证研究显示,三分之一以上的查询含有噪音,使得它们偏离了自然用户查询。在现实世界情景中,通过噪音数据培训的模型面临严重的性能退化。要改进数据集质量,使其样本的语义查询与实际用户查询完全相同,这对于神经代码搜索的实际可用性至关重要。在本文件中,我们提议了一个数据清理框架,由两个后续过滤器组成:基于规则的合成过滤器和基于模型的语义过滤器。这是第一个框架,在代码搜索数据集中应用语义查询。实验性地,我们用两个广泛使用的模型框架的效能与实际用户查询功能相同。 3,通过测试21号的平均代码测试模型和3个手动测试框架。