评价源码采集器对ML4SE模型的影响 (Evaluating the Impact of Source Code Parsers on ML4SE Models)

As researchers and practitioners apply Machine Learning to increasingly more software engineering problems, the approaches they use become more sophisticated. A lot of modern approaches utilize internal code structure in the form of an abstract syntax tree (AST) or its extensions: path-based representation, complex graph combining AST with additional edges. Even though the process of extracting ASTs from code can be done with different parsers, the impact of choosing a parser on the final model quality remains unstudied. Moreover, researchers often omit the exact details of extracting particular code representations. In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the method name prediction task backed by eight different parsers for the Java language. To unify the process of data preparation with different parsers, we develop SuperParser, a multi-language parser-agnostic library based on PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable for training and evaluation of ML models that work with structural information from source code. Our results demonstrate that trees built by different parsers vary in their structure and content. We then analyze how this diversity affects the models' quality and show that the quality gap between the most and least suitable parsers for both models turns out to be significant. Finally, we discuss other features of the parsers that researchers and practitioners should take into account when selecting a parser along with the impact on the models' quality. The code of SuperParser is publicly available at https://doi.org/10.5281/zenodo.6366591. We also publish Java-norm, the dataset we use to evaluate the models: https://doi.org/10.5281/zenodo.6366599.

翻译：由于研究人员和从业人员将机器学习应用到软件工程学上的问题越来越多,他们使用的方法也越来越复杂。许多现代方法使用内部代码结构,即抽象的语法树(AST)或其扩展:基于路径的表述,将AST与额外边缘相结合的复杂图形。尽管从代码中提取AST的过程可以与不同的分析师一起完成,但是选择对最终模型质量的剖析器的影响仍未研究。此外,研究人员往往忽略了提取特定代码演示的确切细节。在这项工作中,我们评估了两种模型,即代号2Seq 和 TreaLSTM, 其形式为: 由八种不同对爪哇语言的剖析师支持的方法名称预测任务。为了将数据编制过程与不同的剖析师统一起来,我们开发了超PaperParker,这是基于Pathminer的多语言剖析师图书馆。SeperParceer为最终模型的端到端创建适合从源代码中选取质量模型的数据集。我们的成果表明,由不同的正方对模型进行不同对质量模型进行最不同的排序的树木和最接近的模型进行分析。我们对质量的模型和最接近的模型的模型进行分析。