Representation learning of source code is essential for applying machine learning to software engineering tasks. Learning code representation across different programming languages has been shown to be more effective than learning from single-language datasets, since more training data from multi-language datasets improves the model's ability to extract language-agnostic information from source code. However, existing multi-language models overlook the language-specific information which is crucial for downstream tasks that is training on multi-language datasets, while only focusing on learning shared parameters among the different languages. To address this problem, we propose MetaTPTrans, a meta learning approach for multilingual code representation learning. MetaTPTrans generates different parameters for the feature extractor according to the specific programming language of the input source code snippet, enabling the model to learn both language-agnostics and language-specific information. Experimental results show that MetaTPTrans improves the F1 score of state-of-the-art approaches significantly by up to 2.40 percentage points for code summarization, a language-agnostic task; and the prediction accuracy of Top-1 (Top-5) by up to 7.32 (13.15) percentage points for code completion, a language-specific task.
翻译:对源代码进行代表式学习对于应用机器学习软件工程任务至关重要。不同编程语言的学习代码显示比从单一语言数据集学习更有效,因为来自多语种数据集的更多培训数据提高了该模式从源代码中提取语言不可知信息的能力。然而,现有的多语言模型忽略了对下游任务至关重要的语言特定信息,即多语言数据集培训,而仅仅侧重于学习不同语言之间的共享参数。为解决这一问题,我们提议MetaTPTrans,这是多语言代码代表学习的一种元学习方法。MetaTPTrans根据输入源代码片断的具体编程语言为功能提取器生成了不同的参数,使该模型能够学习语言不可知性和语言特定信息。实验结果表明,MetaTPTrans通过高达2.40百分点的速率,用于拼凑代码,一种语言认知任务;Top-1(Top-5)预测精度,最长为7.32分(13.15分),用于完成一个具体语言任务。