Code completion aims to help improve developers' productivity by suggesting the next code tokens from a given context. Various approaches have been proposed to incorporate abstract syntax tree (AST) information for model training, ensuring that code completion is aware of the syntax of the programming languages. However, existing syntax-aware code completion approaches are not on-the-fly, as we found that for every two-thirds of characters that developers type, AST fails to be extracted because it requires the syntactically correct source code, limiting its practicality in real-world scenarios. On the other hand, existing on-the-fly code completion does not consider syntactic information yet. In this paper, we propose PyCoder to leverage token types, a kind of lightweight syntactic information, which is readily available and aligns with the natural order of source code. Our PyCoder is trained in a multi-task training manner so that by learning the supporting task of predicting token types during the training phase, the models achieve better performance on predicting tokens and lines of code without the need for token types in the inference phase. Comprehensive experiments show that PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions, which is 0.43%-24.25% more accurate than baselines. In addition, PyCoder achieves an exact match of 43.37% for the line-level predictions, which is 3.63%-84.73% more accurate than baselines. These results lead us to conclude that token type information (an alternative to syntactic information) that is rarely used in the past can greatly improve the performance of code completion approaches, without requiring the syntactically correct source code like AST-based approaches do. Our PyCoder is publicly available on HuggingFace.
翻译:代码完成的目的是通过建议特定背景下的下一个代码符号来帮助提高开发者的生产率。 已经提议了各种方法, 以将抽象的语法树( AST) 信息纳入模型培训, 确保代码完成了解编程语言的语法。 但是, 现有的语法觉代码完成方法不是实时的, 我们发现, 对于开发者类型的每三分之二字符来说, AST 未能被提取, 因为它需要同步正确的源代码, 限制其在真实世界情景中的实际实用性。 另一方面, 现有在飞行代码完成时并不考虑合成信息。 在本文中, 我们建议使用 PyCoder 代码来利用符号类型, 一种轻度的语法代码完成方法, 因为我们的PyCoder 代码正在通过多功能培训, 通过学习在培训阶段中预测符号类型的辅助任务, 也很难在虚拟格式方法上实现代码线和代码的运行, 在预言类型中, A- ST- L 级预言的预言中, 我们的直径直径直径直径直径直径直径直径直径直径直径直径直径直达直径直径直径直径直径直径直径直径, 。