Neural program embeddings have demonstrated considerable promise in a range of program analysis tasks, including clone identification, program repair, code completion, and program synthesis. However, most existing methods generate neural program embeddings directly from the program source codes, by learning from features such as tokens, abstract syntax trees, and control flow graphs. This paper takes a fresh look at how to improve program embeddings by leveraging compiler intermediate representation (IR). We first demonstrate simple yet highly effective methods for enhancing embedding quality by training embedding models alongside source code and LLVM IR generated by default optimization levels (e.g., -O2). We then introduce IRGen, a framework based on genetic algorithms (GA), to identify (near-)optimal sequences of optimization flags that can significantly improve embedding quality.
翻译:神经程序嵌入在一系列方案分析任务(包括克隆识别、程序修复、代码完成以及程序合成)中显示出了相当大的希望。 然而,大多数现有方法都通过学习符号、抽象语法树和控制流程图等特征,直接从程序源代码中产生神经程序嵌入。本文件重新审视了如何通过利用编译器中间代表(IR)来改进程序嵌入。我们首先展示了简单而高效的方法,通过培训将模型嵌入源代码和默认优化水平(例如-O2)生成的LLLLVM IR(LLLVM IR),来提高嵌入质量。 然后我们引入了IRGen,这是一个基于基因算法(GA)的框架,以识别(近)最佳的优化标记序列,从而显著改善嵌入质量。