DOBF: 编程语言教学前训练前脱盲目标 (DOBF: A Deobfuscation Pre-Training Objective for Programming Languages)

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13% in unsupervised code translation, and 24% in natural language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names.

翻译：自我监督学习的最新进展极大地改善了多种任务的最新水平。但是,语言模式预培训前研究大多侧重于自然语言,不清楚像BERT及其变体这样的模型在应用到源代码等其他模式时是否提供了最佳的预培训。在本文中,我们引入了新的培训前目标,即DOBF, 利用编程语言的结构方面和编程前的模型来恢复原始版本的模糊源代码。我们显示,在DOBF培训的模型大大优于现有多下游任务的方法,提供了高达13%的未经监督的代码翻译的相对改进,以及24%的自然语言代码搜索。顺便提一下,我们发现,我们经过培训的模型能够去除完全模糊的源文档,并提出描述性变量。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【GPT-3作者亲解】超大型语言模型少样本学习，109页ppt

专知会员服务

109+阅读 · 2020年12月19日

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日