In the field of source code processing, the transformer-based representation models have shown great powerfulness and have achieved state-of-the-art (SOTA) performance in many tasks. Although the transformer models process the sequential source code, pieces of evidence show that they may capture the structural information (\eg, in the syntax tree, data flow, control flow, \etc) as well. We propose the aggregated attention score, a method to investigate the structural information learned by the transformer. We also put forward the aggregated attention graph, a new way to extract program graphs from the pre-trained models automatically. We measure our methods from multiple perspectives. Furthermore, based on our empirical findings, we use the automatically extracted graphs to replace those ingenious manual designed graphs in the Variable Misuse task. Experimental results show that the semantic graphs we extracted automatically are greatly meaningful and effective, which provide a new perspective for us to understand and use the information contained in the model.
翻译:在源代码处理领域,以变压器为基础的代表模型显示出巨大的威力,并在许多任务中实现了最先进的(SOTA)性能。尽管变压器模型处理顺序源代码,但有证据表明它们也可以捕捉结构信息(例如,语法树、数据流、控制流、\etc) 。我们建议了综合关注评分,这是调查变压器所学结构信息的一种方法。我们还提出了汇总关注图,这是从预先训练的模型中自动提取程序图的新方式。我们从多种角度衡量我们的方法。此外,根据我们的经验发现,我们用自动提取的图表取代了变量滥用任务中那些巧妙的手工设计的图表。实验结果表明,我们自动提取的语义图非常有意义和有效,这为我们理解和使用模型中所含信息提供了新的视角。