Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
翻译:源代码( ML4Code) 的机器学习是一个积极的研究领域,需要在这个领域进行广泛的实验,以发现如何最佳使用源代码的丰富结构化信息。考虑到这一点,我们为 ML4Code 应用程序引入了可扩展的 Java 数据集,即ML4Code 应用程序的可扩展的 Java 数据集,这是一个大型、多样化和高质量的数据集,针对 ML4Code 。我们与JEMA 的目标是降低ML4Code 的进入屏障,为源代码模型和任务提供实验的构件。 JEMA 带来了大量预处理的信息,如元数据、显示器(例如代码符号、AST、图表)和若干属性(例如指标、静态分析结果),用于50KC 数据集的50,000 Java 项目,有120万个以上的课程和800多万种方法。 我们与JEMMA 的目标是让用户为数据集添加新的属性和表达方式,并评估它们的任务。因此, JEMMA 成为研究人员可以用来对新表述和操作的数据模型和任务任务任务任务进行实验, 在源代码中最终展示我们设计的数据源代码的系统。