它们捕捉什么? -- -- 对《源代码》培训前语言模式的结构分析 (What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code) - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · CodeBERT · 注意力机制 · Better ·

2022 年 2 月 14 日

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

翻译：它们捕捉什么? -- -- 对《源代码》培训前语言模式的结构分析

Yao Wan,Wei Zhao,Hongyu Zhang,Yulei Sui,Guandong Xu,Hai Jin

from arxiv, Accepted by ICSE 2022 (The 44th International Conference on Software Engineering)

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

翻译：最近,提出了许多源代码的预先培训语言模型,以建模代码背景,并作为代码完成、代码搜索和代码汇总等下游代码情报任务的基础。这些模型利用了蒙面培训前和变换器,取得了令人乐观的成果。然而,目前,在现有的经培训前代码模型的可解释性方面进展甚微。不清楚为什么这些模型起作用,以及它们能够捕捉到的关联特征。在本文件中,我们进行了彻底的结构分析,目的是从三个不同的角度解释源代码(例如,代码BERT和GreaterCodeBERT)的预先培训语言模型。这些模型从三个不同的角度:(1) 关注分析,(2) 对嵌入的词进行探测,(3) 合成税树诱导。通过全面分析,本文件揭示了若干有见见的结论,可以激发未来的研究:(1) 注意与代码的合成税结构紧密一致。(2) 培训前语言模型可以在每个变换层的中间表述中保留代码的语法结构。(3) 预先培训前的代码模型有能力将编码的合成树引入代码的更好结构。

0

相关内容

语言模型化

语言模型化

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL 2019 Tutorials】从结构化数据和知识图谱中讲故事：NLG的观点（Storytelling from Structured Data and Knowledge Graphs : An NLG Perspective）

【ACL 2019 Tutorials】从结构化数据和知识图谱中讲故事：NLG的观点（Storytelling from Structured Data and Knowledge Graphs : An NLG Perspective）

专知会员服务

26+阅读 · 2019年11月18日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

磁驱动形状记忆合金结构转变与磁转变耦合及其与功能特性的关联性研究

国家自然科学基金

0+阅读 · 2014年12月31日

引入位置关联和竞争机制的多模态生物融合识别方法

国家自然科学基金

0+阅读 · 2014年12月31日

基于概率本体的CPS入侵检测方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于RTD的可编程逻辑模块及数字系统设计研究

国家自然科学基金

0+阅读 · 2014年12月31日

点传递对称图若干问题的研究

国家自然科学基金

0+阅读 · 2014年12月31日

稀疏张量学习理论

国家自然科学基金

1+阅读 · 2012年12月31日

非结构化农业场景的条件随机场感知模型研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于领域本体的蒙古文数字资源整合机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

中国对虾IGF-ⅠIGF-Ⅱ#22522;因序列及其多态性与生长性状的关联分析

国家自然科学基金

0+阅读 · 2009年12月31日

启发式算法设计中的骨架分析与应用

国家自然科学基金

0+阅读 · 2008年12月31日

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Arxiv

0+阅读 · 2022年4月17日

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Arxiv

0+阅读 · 2022年4月17日

SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words

Arxiv

0+阅读 · 2022年4月16日

A Survey of Knowledge Enhanced Pre-trained Models

Arxiv

28+阅读 · 2021年10月1日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Learning with Interpretable Structure from RNN

Arxiv

19+阅读 · 2018年10月25日

Dissecting Contextual Word Embeddings: Architecture and Representation

Dissecting Contextual Word Embeddings: Architecture and Representation

Arxiv

22+阅读 · 2018年8月27日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

VIP会员

文章信息

相关主题

语言模型化

注意力机制

相关VIP内容

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL 2019 Tutorials】从结构化数据和知识图谱中讲故事：NLG的观点（Storytelling from Structured Data and Knowledge Graphs : An NLG Perspective）

【ACL 2019 Tutorials】从结构化数据和知识图谱中讲故事：NLG的观点（Storytelling from Structured Data and Knowledge Graphs : An NLG Perspective）

专知会员服务

26+阅读 · 2019年11月18日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《社交媒体信息作战》最新48页技术报告

《美空军条令出版物：战略打击》最新条令

《使用量化测量将传感器节点关联到融合中心的算法设计》171页

军事前沿模型

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

相关论文

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Arxiv

0+阅读 · 2022年4月17日

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Arxiv

0+阅读 · 2022年4月17日

SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words

Arxiv

0+阅读 · 2022年4月16日

A Survey of Knowledge Enhanced Pre-trained Models

Arxiv

28+阅读 · 2021年10月1日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

Learning with Interpretable Structure from RNN

Arxiv

19+阅读 · 2018年10月25日

Dissecting Contextual Word Embeddings: Architecture and Representation

Dissecting Contextual Word Embeddings: Architecture and Representation

Arxiv

22+阅读 · 2018年8月27日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

相关基金

磁驱动形状记忆合金结构转变与磁转变耦合及其与功能特性的关联性研究

国家自然科学基金

0+阅读 · 2014年12月31日

引入位置关联和竞争机制的多模态生物融合识别方法

国家自然科学基金

0+阅读 · 2014年12月31日

基于概率本体的CPS入侵检测方法研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于RTD的可编程逻辑模块及数字系统设计研究

国家自然科学基金

0+阅读 · 2014年12月31日

点传递对称图若干问题的研究

国家自然科学基金

0+阅读 · 2014年12月31日

稀疏张量学习理论

国家自然科学基金

1+阅读 · 2012年12月31日

非结构化农业场景的条件随机场感知模型研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于领域本体的蒙古文数字资源整合机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

中国对虾IGF-ⅠIGF-Ⅱ#22522;因序列及其多态性与生长性状的关联分析

国家自然科学基金

0+阅读 · 2009年12月31日

启发式算法设计中的骨架分析与应用

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员