Understanding the correct API usage sequences is one of the most important tasks for programmers when they work with unfamiliar libraries. However, programmers often encounter obstacles to finding the appropriate information due to either poor quality of API documentation or ineffective query-based searching strategy. To help solve this issue, researchers have proposed various methods to suggest the sequence of APIs given natural language queries representing the information needs from programmers. Among such efforts, Gu et al. adopted a deep learning method, in particular an RNN Encoder-Decoder architecture, to perform this task and obtained promising results on common APIs in Java. In this work, we aim to reproduce their results and apply the same methods for APIs in Python. Additionally, we compare the performance with a more recent Transformer-based method, i.e., CodeBERT, for the same task. Our experiment reveals a clear drop in performance measures when careful data cleaning is performed. Owing to the pretraining from a large number of source code files and effective encoding technique, CodeBERT outperforms the method by Gu et al., to a large extent.
翻译:在与不熟悉的图书馆合作时,了解正确的API使用序列是程序员最重要的任务之一;然而,由于API文件质量差或基于查询的搜索策略无效,程序员往往在寻找适当信息方面遇到困难。为了解决这一问题,研究人员提出了各种方法,以建议按自然语言询问的API序列代表程序员的信息需求。在这些努力中,Gu等人采取了一种深层次的学习方法,特别是RNN Encoder-Decoder结构,以完成这项任务,并在爪哇的共同API上取得了有希望的结果。在这项工作中,我们力求复制其结果,对Python的API采用同样的方法。此外,我们为了同一任务,将业绩与较近期的基于变异器的方法,即代码BERT相比较。我们的实验显示,在认真清理数据时,业绩计量明显下降。由于从大量源代码文档和有效编码技术的预培训,编码BERET大大超出Gu等人的方法。