解开什么代码概述模式的神秘</s> (Demystifying What Code Summarization Models Learned)

Study patterns that models have learned has long been a focus of pattern recognition research. Explaining what patterns are discovered from training data, and how patterns are generalized to unseen data are instrumental to understanding and advancing the pattern recognition methods. Unfortunately, the vast majority of the application domains deal with continuous data (i.e. statistical in nature) out of which extracted patterns can not be formally defined. For example, in image classification, there does not exist a principle definition for a label of cat or dog. Even in natural language, the meaning of a word can vary with the context it is surrounded by. Unlike the aforementioned data format, programs are a unique data structure with a well-defined syntax and semantics, which creates a golden opportunity to formalize what models have learned from source code. This paper presents the first formal definition of patterns discovered by code summarization models (i.e. models that predict the name of a method given its body), and gives a sound algorithm to infer a context-free grammar (CFG) that formally describes the learned patterns. We realize our approach in PATIC which produces CFGs for summarizing the patterns discovered by code summarization models. In particular, we pick two prominent instances, code2vec and code2seq, to evaluate PATIC. PATIC shows that the patterns extracted by each model are heavily restricted to local, and syntactic code structures with little to none semantic implication. Based on these findings, we present two example uses of the formal definition of patterns: a new method for evaluating the robustness and a new technique for improving the accuracy of code summarization models. Our work opens up this exciting, new direction of studying what models have learned from source code.

翻译：模型所学的研究模式长期以来一直是模式识别研究的焦点。解释从培训数据中发现的模式是什么, 以及模式如何被普遍化为隐蔽数据如何有助于理解和推进模式识别方法。不幸的是, 绝大多数应用领域涉及连续数据( 即统计性质), 无法正式定义提取模式。例如, 在图像分类中, 猫或狗标签没有原则定义。即使在自然语言中, 单词的含义也可以随其周围环境的变化而变化。与上述数据格式不同, 程序是一种独特的数据结构, 具有定义明确的合成法和语义学, 这为将模型从源代码代码代码代码中学到的东西正规化创造了一个绝佳的机会。本文展示了由代码合成模型( 统计性质) 所发现的模式的第一个正式定义( 预测一个方法名称的模型), 并提供了一种合理的算法推导出一个没有上背景的语义的语义( CFCIG) 。我们意识到, 我们用计算机化的方法, 将CFG 模型从代码来总结以代码以代码基底部模式,,, 以缩缩化结构。</s>

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/