代码4ML:附加说明的机器学习守则的大规模数据集 (Code4ML: a Large-scale Dataset of annotated Machine Learning Code) - 专知论文

会员服务 ·

0

代码 · 数据集 · Machine Learning · Learning · ML ·

2022 年 10 月 28 日

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

翻译：代码4ML:附加说明的机器学习守则的大规模数据集

Anastasia Drozdova,Polina Guseva,Ekaterina Trofimova,Anna Scherbakova,Andrey Ustyuzhanin

from arxiv, Under review

Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However, without annotation number of methods that could be applied is somewhat limited. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions and dataset descriptions publicly available from Kaggle - the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can potentially help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.

翻译：作为数据源的程序代码在数据科学界越来越受欢迎。关于这类资产的培训模型的可能应用范围从数据维度减少分类到自动代码生成等,但无需说明可以应用的方法数量多少有限。为了解决缺少附加说明的数据集的问题,我们介绍了代码4ML文稿。它包含由卡格格勒(数据科学竞赛主机托管平台)公开提供的代码片段、任务摘要、竞赛和数据集说明。该文稿包括从~10万吉比特笔记收集的~250万个ML代码片段。由人类评估人员通过专门为此设计的用户友好界面对片段进行附加说明。代码4ML数据集可能有助于通过数据驱动方法解决软件工程或数据科学方面的诸多挑战。例如,它可以有助于对自然语言规定的ML任务进行语法分类、代码自动完成和代码生成。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

专知会员服务

310+阅读 · 2020年2月26日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

244+阅读 · 2019年10月21日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

中国图象图形学学会CSIG

0+阅读 · 2021年11月15日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

同型半胱氨酸经ERK通路上调ETB受体表达促血管平滑肌细胞增殖机制

国家自然科学基金

0+阅读 · 2015年12月31日

奶牛乳腺脂类合成代谢转录调控机制与基因网络构建

国家自然科学基金

0+阅读 · 2014年12月31日

肺内皮细胞S1PR1受体在流感病毒所致ARDS中的作用

国家自然科学基金

1+阅读 · 2014年12月31日

PPAR β/δ基因在结直肠癌血管生成调控中的作用及分子机理

国家自然科学基金

2+阅读 · 2014年12月31日

骨髓间充质干细胞失巢凋亡下存活及迁移机制

国家自然科学基金

0+阅读 · 2013年12月31日

雌激素通过ERα介导lncRNA 1200076调节卵巢ERα（+）细胞生物学行为

国家自然科学基金

0+阅读 · 2012年12月31日

不同途径移植HUCB-MSCs治疗脑血管病大鼠microPET-CT评价及其治疗机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

SARI转录抑制机制及在急性髓细胞白血病发病中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

PSCA对前列腺癌细胞自分泌IL-6的调控作用及其机制

国家自然科学基金

0+阅读 · 2012年12月31日

InSAR支持下数据与知识驱动的区域滑坡空间预测

国家自然科学基金

0+阅读 · 2012年12月31日

SplitGP: Achieving Both Generalization and Personalization in Federated Learning

Arxiv

0+阅读 · 2022年12月16日

MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text

Arxiv

0+阅读 · 2022年12月14日

Challenges of Artificial Intelligence -- From Machine Learning and Computer Vision to Emotional Intelligence

Arxiv

19+阅读 · 2022年1月5日

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Arxiv

11+阅读 · 2021年4月29日

A Survey of Machine Learning for Computer Architecture and Systems

Arxiv

18+阅读 · 2021年2月16日

InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions

InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions

Arxiv

13+阅读 · 2019年11月1日

Hierarchical Graph Representation Learning with Differentiable Pooling

Hierarchical Graph Representation Learning with Differentiable Pooling

Arxiv

14+阅读 · 2018年6月26日

Low-Shot Learning from Imaginary Data

Arxiv

15+阅读 · 2018年4月3日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

Adversarial Learning for Chinese NER from Crowd Annotations

Arxiv

15+阅读 · 2018年1月16日

VIP会员

文章信息

相关主题

Machine Learning

相关VIP内容

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

【新书】Python机器学习实战，545页pdf，Practical Machine Learning with Python

专知会员服务

310+阅读 · 2020年2月26日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

244+阅读 · 2019年10月21日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

数据智能体综述：新兴范式还是被高估的炒作？

海底战已至：美国构思海底安全战略 | 最新报告

【ICCV2025教程】视觉异常检测中的基础模型：进展、挑战与应用

美军将无人自主等新技术融入潜艇部队以更具杀伤力

相关资讯

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium7

中国图象图形学学会CSIG

0+阅读 · 2021年11月15日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

SplitGP: Achieving Both Generalization and Personalization in Federated Learning

Arxiv

0+阅读 · 2022年12月16日

MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text

Arxiv

0+阅读 · 2022年12月14日

Challenges of Artificial Intelligence -- From Machine Learning and Computer Vision to Emotional Intelligence

Arxiv

19+阅读 · 2022年1月5日

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Arxiv

11+阅读 · 2021年4月29日

A Survey of Machine Learning for Computer Architecture and Systems

Arxiv

18+阅读 · 2021年2月16日

InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions

InteractE: Improving Convolution-based Knowledge Graph Embeddings by Increasing Feature Interactions

Arxiv

13+阅读 · 2019年11月1日

Hierarchical Graph Representation Learning with Differentiable Pooling

Hierarchical Graph Representation Learning with Differentiable Pooling

Arxiv

14+阅读 · 2018年6月26日

Low-Shot Learning from Imaginary Data

Arxiv

15+阅读 · 2018年4月3日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

Adversarial Learning for Chinese NER from Crowd Annotations

Arxiv

15+阅读 · 2018年1月16日

相关基金

同型半胱氨酸经ERK通路上调ETB受体表达促血管平滑肌细胞增殖机制

国家自然科学基金

0+阅读 · 2015年12月31日

奶牛乳腺脂类合成代谢转录调控机制与基因网络构建

国家自然科学基金

0+阅读 · 2014年12月31日

肺内皮细胞S1PR1受体在流感病毒所致ARDS中的作用

国家自然科学基金

1+阅读 · 2014年12月31日

PPAR β/δ基因在结直肠癌血管生成调控中的作用及分子机理

国家自然科学基金

2+阅读 · 2014年12月31日

骨髓间充质干细胞失巢凋亡下存活及迁移机制

国家自然科学基金

0+阅读 · 2013年12月31日

雌激素通过ERα介导lncRNA 1200076调节卵巢ERα（+）细胞生物学行为

国家自然科学基金

0+阅读 · 2012年12月31日

不同途径移植HUCB-MSCs治疗脑血管病大鼠microPET-CT评价及其治疗机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

SARI转录抑制机制及在急性髓细胞白血病发病中的作用

国家自然科学基金

0+阅读 · 2012年12月31日

PSCA对前列腺癌细胞自分泌IL-6的调控作用及其机制

国家自然科学基金

0+阅读 · 2012年12月31日

InSAR支持下数据与知识驱动的区域滑坡空间预测

国家自然科学基金

0+阅读 · 2012年12月31日

微信扫码咨询专知VIP会员