Semantic understanding of programs is a fundamental problem for programming language processing (PLP). Recent works that learn representations of code based on pre-training techniques in NLP have pushed the frontiers in this direction. However, the semantics of PL and NL have essential differences. These being ignored, we believe it is difficult to build a model to better understand programs, by either directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model by the heuristic. In fact, the semantics of a program can be rigorously defined by formal semantics in PL theory. For example, the operational semantics, describes the meaning of a valid program as updating the environment (i.e., the memory address-value function) through fundamental operations, such as memory I/O and conditional branching. Inspired by this, we propose a novel program semantics learning paradigm, that the model should learn from information composed of (1) the representations which align well with the fundamental operations in operational semantics, and (2) the information of environment transition, which is indispensable for program understanding. To validate our proposal, we present a hierarchical Transformer-based pre-training model called OSCAR to better facilitate the understanding of programs. OSCAR learns from intermediate representation (IR) and an encoded representation derived from static analysis, which are used for representing the fundamental operations and approximating the environment transitions respectively. OSCAR empirically shows the outstanding capability of program semantics understanding on many practical software engineering tasks.
翻译:语义对程序的理解是语言处理程序(PLP)的根本问题。 最近在NLP中学习基于培训前技术的代码表达方式的工程,已经推向了这一方向。 但是,PL和NL的语义存在根本差异。 这些差异被忽略了,我们认为,很难通过直接将现成的 NLP 培训前的技术应用到源代码中,或者添加超自然的特征来构建一个更好地了解程序的模式。 事实上,一个程序的语义可以严格地由PL理论中的正式语义来定义。例如,操作语义将一个有效的程序的含义描述为通过基本操作来更新环境(即记忆 I/ O 和有条件的分支) 。 我们为此提出一个新的程序语义学学习模式, 该模式应该从以下信息中学习:(1) 与操作性语义中的基本操作相一致的描述,以及(2) 环境转型信息,这是程序理解所不可或缺的。 为了校正我们使用的代言法的代言程序,我们展示了一种代言语法的代言式,我们用了一种代言式的代言式程序, 学习了一种内部软件的基本程序。