Representation learning of source code is essential for applying machine learning to software engineering tasks. Learning code representation from a multilingual source code dataset has been shown to be more effective than learning from single-language datasets separately, since more training data from multilingual dataset improves the model's ability to extract language-agnostic information from source code. However, existing multilingual training overlooks the language-specific information which is crucial for modeling source code across different programming languages, while only focusing on learning a unified model with shared parameters among different languages for language-agnostic information modeling. To address this problem, we propose MetaTPTrans, a meta learning approach for multilingual code representation learning. MetaTPTrans generates different parameters for the feature extractor according to the specific programming language type of the input code snippet, enabling the model to learn both language-agnostic and language-specific information with dynamic parameters in the feature extractor. We conduct experiments on the code summarization and code completion tasks to verify the effectiveness of our approach. The results demonstrate the superiority of our approach with significant improvements on state-of-the-art baselines.
翻译:对源代码进行代表式学习对于应用机器学习软件工程任务至关重要。多语言源代码数据集的学习代码显示比单独学习单一语言数据集更有效,因为来自多语言数据集的更多培训数据提高了该模型从源代码中提取语言不可知信息的能力。然而,现有的多语种培训忽略了对在不同编程语言中建模源代码至关重要的语言特定信息,而只是侧重于学习一种统一模式,不同语言共享参数,用于语言不可知信息建模。为了解决这一问题,我们提出了MetaTPTrans,这是多语言代码代号学习的元学习方法。MetaTPTrans根据输入代码片断的具体编程语言类型为特征提取器生成了不同的参数,使该模型既学习语言不可知性信息,也学习语言特定信息,并在功能提取器中具有动态参数。我们进行了代码汇总和代码完成任务的实验,以核实我们方法的有效性。结果显示我们方法的优越性,并显著改进了最新基线。