The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation - 专知论文

会员服务 ·

0

可理解性 · 数据集 · 泛函 · 代码 · 语言模型化 ·

2023 年 5 月 9 日

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

翻译：暂无翻译

Dung Nguyen Manh,Nam Le Hai,Anh T. V. Dau,Anh Minh Nguyen,Khanh Nghiem,Jin Guo,Nghi D. Q. Bui

We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.

翻译：暂无翻译

0

相关内容

可理解性

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

126+阅读 · 2022年4月21日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

DR5调节JNK通路及Bid磷酸化在结肠癌耐药中的分子作用机制及临床意义

国家自然科学基金

0+阅读 · 2015年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

关系的分解与Domain的表示

国家自然科学基金

1+阅读 · 2011年12月31日

CIECAM02拓展研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于几何约束lifting技术的细分小波变换研究

国家自然科学基金

0+阅读 · 2009年12月31日

DAT: Data Architecture Modeling Tool for Data-Driven Applications

Arxiv

0+阅读 · 2023年6月22日

A Comprehensive Survey on Source-free Domain Adaptation

Arxiv

10+阅读 · 2023年2月23日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

Artificial Intelligence for the Metaverse: A Survey

Arxiv

31+阅读 · 2022年2月15日

A Comprehensive Survey on Transfer Learning

A Comprehensive Survey on Transfer Learning

Arxiv

121+阅读 · 2019年11月7日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

126+阅读 · 2022年4月21日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《利用人工智能改善军事警察行动：当下现状探索》最新95页报告

《用于适应性、任务就绪型军用仿生机器人的合成数据管道》

面向现代武装力量的高级AI驱动军事模拟与训练软件

《军事应用中的AI：建立信任》最新报告

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

DAT: Data Architecture Modeling Tool for Data-Driven Applications

Arxiv

0+阅读 · 2023年6月22日

A Comprehensive Survey on Source-free Domain Adaptation

Arxiv

10+阅读 · 2023年2月23日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

Artificial Intelligence for the Metaverse: A Survey

Arxiv

31+阅读 · 2022年2月15日

A Comprehensive Survey on Transfer Learning

A Comprehensive Survey on Transfer Learning

Arxiv

121+阅读 · 2019年11月7日

相关基金

DR5调节JNK通路及Bid磷酸化在结肠癌耐药中的分子作用机制及临床意义

国家自然科学基金

0+阅读 · 2015年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

关系的分解与Domain的表示

国家自然科学基金

1+阅读 · 2011年12月31日

CIECAM02拓展研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于几何约束lifting技术的细分小波变换研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员