Study patterns that models have learned has long been a focus of pattern recognition research. Explaining what patterns are discovered from training data, and how patterns are generalized to unseen data are instrumental to understanding and advancing the pattern recognition methods. Unfortunately, the vast majority of the application domains deal with continuous data (i.e. statistical in nature) out of which extracted patterns can not be formally defined. For example, in image classification, there does not exist a principle definition for a label of cat or dog. Even in natural language, the meaning of a word can vary with the context it is surrounded by. Unlike the aforementioned data format, programs are a unique data structure with a well-defined syntax and semantics, which creates a golden opportunity to formalize what models have learned from source code. This paper presents the first formal definition of patterns discovered by code summarization models (i.e. models that predict the name of a method given its body), and gives a sound algorithm to infer a context-free grammar (CFG) that formally describes the learned patterns. We realize our approach in PATIC which produces CFGs for summarizing the patterns discovered by code summarization models. In particular, we pick two prominent instances, code2vec and code2seq, to evaluate PATIC. PATIC shows that the patterns extracted by each model are heavily restricted to local, and syntactic code structures with little to none semantic implication. Based on these findings, we present two example uses of the formal definition of patterns: a new method for evaluating the robustness and a new technique for improving the accuracy of code summarization models. Our work opens up this exciting, new direction of studying what models have learned from source code.
翻译:模型所学的研究模式长期以来一直是模式识别研究的焦点。 解释从培训数据中发现的模式是什么, 以及模式如何被普遍化为隐蔽数据如何有助于理解和推进模式识别方法。 不幸的是, 绝大多数应用领域涉及连续数据( 即统计性质), 无法正式定义提取模式。 例如, 在图像分类中, 猫或狗标签没有原则定义。 即使在自然语言中, 单词的含义也可以随其周围环境的变化而变化。 与上述数据格式不同, 程序是一种独特的数据结构, 具有定义明确的合成法和语义学, 这为将模型从源代码代码代码代码中学到的东西正规化创造了一个绝佳的机会。 本文展示了由代码合成模型( 统计性质) 所发现的模式的第一个正式定义( 预测一个方法名称的模型), 并提供了一种合理的算法推导出一个没有上背景的语义的语义( CFCIG) 。 我们意识到, 我们用计算机化的方法, 将CFG 模型从代码 来总结以代码 以代码 基底部模式,,, 以 缩缩化 结构 。</s>