用于跨语言方案分类的统一语法树代表性学习 (Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification) - 专知论文

会员服务 ·

0

代码 · 可约的 · 学成 · 表示学习 · 查全率/召回率 ·

2022 年 5 月 1 日

Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification

翻译：用于跨语言方案分类的统一语法树代表性学习

Kesu Wang,Meng Yan,He Zhang,Haibo Hu

Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract Syntax Tree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach outperforms the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).

翻译：程序分类可以被视为一种高层次的代码抽象,为与源代码理解有关的各种任务奠定基础,并且具有软件工程领域非常广泛的应用,例如代码克隆检测、代码嗅觉分类、缺陷分类等。跨语言程序分类可以实现不同编程语言的代码传输,还可以促进跨语言代码再利用,从而帮助开发者快速编写代码并减少代码传输的开发时间。大多数现有研究侧重于代码的语义学习,而用于跨语言任务的研究则很少。跨语言程序分类的主要挑战是如何提取不同编程语言的语义学特征。为了应对这一困难,我们提议了一个统一的简易语系树(本文中的 UAST ) 神经网络。详细来说, UAST 的核心理念包括两个统一的机制。首先, UAST 通过统一AST Transiversal 序列和像图表一样的 AST 结构来采集语系代码特征。其次,我们构建了一个叫做统一的词汇学的机制,这个机制可以减少我们所使用的不同语言的语系之间的语系。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

CPU-GPU耦合架构下数据库连接技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向工程设计的有限元软件系统架构研究

国家自然科学基金

0+阅读 · 2012年12月31日

航天器大规模集群飞行的分布式协调控制方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

HER2/uPAR通路调控乳腺肿瘤休眠和细胞周期的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

面向高性能混合计算的计算型存储器体系结构研究

国家自然科学基金

0+阅读 · 2009年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

组合Web服务的建模与验证

国家自然科学基金

1+阅读 · 2008年12月31日

Lewis y抗原介导的PI3K/Akt2信号转导通路致卵巢癌多药耐药的分子机制

国家自然科学基金

0+阅读 · 2008年12月31日

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Arxiv

0+阅读 · 2022年6月17日

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Arxiv

0+阅读 · 2022年6月17日

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Arxiv

0+阅读 · 2022年6月17日

Cross-Domain Few-Shot Graph Classification

Arxiv

13+阅读 · 2022年1月20日

A Survey on Data Augmentation for Text Classification

A Survey on Data Augmentation for Text Classification

Arxiv

16+阅读 · 2021年7月7日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

Learning Latent Representations to Influence Multi-Agent Interaction

Arxiv

11+阅读 · 2020年11月12日

Interpretable CNNs for Object Classification

Interpretable CNNs for Object Classification

Arxiv

20+阅读 · 2020年3月12日

Graph Signal Processing -- Part II: Processing and Analyzing Signals on Graphs

Graph Signal Processing -- Part II: Processing and Analyzing Signals on Graphs

Arxiv

16+阅读 · 2019年9月23日

Graph Convolutional Networks for Text Classification

Arxiv

31+阅读 · 2018年11月13日

VIP会员

文章信息

相关主题

查全率/召回率

相关VIP内容

计算机科学课程与视频课件合集，Computer Science courses with video lectures

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《俄乌战争中的无人系统：新的战争方式与新兴趋势——来自前线的印象》报告

《海上自主水面船舶远程操作中心：安全可持续运行的多维度分析》

多模态大语言模型下游调优中“保持自我”的重要性

隐身自主无人水下航行器技术如何变革水下作战并重塑海军竞争

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

相关论文

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Arxiv

0+阅读 · 2022年6月17日

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Arxiv

0+阅读 · 2022年6月17日

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Arxiv

0+阅读 · 2022年6月17日

Cross-Domain Few-Shot Graph Classification

Arxiv

13+阅读 · 2022年1月20日

A Survey on Data Augmentation for Text Classification

A Survey on Data Augmentation for Text Classification

Arxiv

16+阅读 · 2021年7月7日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

Learning Latent Representations to Influence Multi-Agent Interaction

Arxiv

11+阅读 · 2020年11月12日

Interpretable CNNs for Object Classification

Interpretable CNNs for Object Classification

Arxiv

20+阅读 · 2020年3月12日

Graph Signal Processing -- Part II: Processing and Analyzing Signals on Graphs

Graph Signal Processing -- Part II: Processing and Analyzing Signals on Graphs

Arxiv

16+阅读 · 2019年9月23日

Graph Convolutional Networks for Text Classification

Arxiv

31+阅读 · 2018年11月13日

相关基金

CPU-GPU耦合架构下数据库连接技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向工程设计的有限元软件系统架构研究

国家自然科学基金

0+阅读 · 2012年12月31日

航天器大规模集群飞行的分布式协调控制方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

HER2/uPAR通路调控乳腺肿瘤休眠和细胞周期的分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

面向高性能混合计算的计算型存储器体系结构研究

国家自然科学基金

0+阅读 · 2009年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

组合Web服务的建模与验证

国家自然科学基金

1+阅读 · 2008年12月31日

Lewis y抗原介导的PI3K/Akt2信号转导通路致卵巢癌多药耐药的分子机制

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员