Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However, without annotation number of methods that could be applied is somewhat limited. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions and dataset descriptions publicly available from Kaggle - the leading platform for hosting data science competitions. The corpus consists of ~2.5 million snippets of ML code collected from ~100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can potentially help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.
翻译:作为数据源的程序代码在数据科学界越来越受欢迎。关于这类资产的培训模型的可能应用范围从数据维度减少分类到自动代码生成等,但无需说明可以应用的方法数量多少有限。为了解决缺少附加说明的数据集的问题,我们介绍了代码4ML文稿。它包含由卡格格勒(数据科学竞赛主机托管平台)公开提供的代码片段、任务摘要、竞赛和数据集说明。该文稿包括从~10万吉比特笔记收集的~250万个ML代码片段。由人类评估人员通过专门为此设计的用户友好界面对片段进行附加说明。代码4ML数据集可能有助于通过数据驱动方法解决软件工程或数据科学方面的诸多挑战。例如,它可以有助于对自然语言规定的ML任务进行语法分类、代码自动完成和代码生成。