Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our empirical analysis of CodeBERT's attention reveals that CodeBERT pays more attention to certain types of tokens and statements such as keywords and data-relevant statements. Based on these findings, we propose DietCodeBERT, which aims at lightweight leverage of large pre-trained models for source code. DietCodeBERT simplifies the input program of CodeBERT with three strategies, namely, word dropout, frequency filtering, and an attention-based strategy which selects statements and tokens that receive the most attention weights during pre-training. Hence, it gives a substantial reduction in the computational cost without hampering the model performance. Experimental results on two downstream tasks show that DietCodeBERT provides comparable results to CodeBERT with 40% less computational cost in fine-tuning and testing.
翻译:诸如 DCBERT 等经过预先培训的代码代表模型在各种软件工程任务中表现出了优异的绩效,然而,这些模型往往十分复杂,随输入序列的长度而四舍五入。我们对编码BERT 注意的经验分析表明,编码BERT 更多地注意某些类型的象征性和声明,例如关键词和数据相关说明。根据这些调查结果,我们提议DietCodeBERT,其目的是对大量预先培训的源代码模型发挥轻量的影响力。DietCodeBERT 简化了编码BERT的输入程序,采用三种战略,即字词退出、频率过滤和基于注意的战略,选择在培训前的加权中最受关注的语句和符号。因此,它大大降低了计算成本,而不妨碍模型的性能。关于两项下游任务的实验结果表明,DieCodeBERT 向编码计算机专家提供了可比的结果,在微调和测试时计算成本降低了40%。