Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in STACKOVERFLOW; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes such code more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse and type imperfect code requires a large training set including many pairs of imperfect code and its repair; such training sets are limited by human effort and curation. In this paper, we present a novel, indirectly supervised, approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments by seeding errors that mimic corruptions found in STACKOVERFLOW and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing STACKOVERFLOW fragments; we also demonstrate that our approach performs well on shorter student error program and achieves best-in-class performance on longer programs that have more than 400 tokens. We also show that by blending Deepfix and our tool, we could achieve 77% accuracy, which outperforms all previously reported student error correction tools.
翻译:专业代码师和教师经常处理不完善的代码( 不成体系、 不完整、 不完善的) 。 这种碎片在STACK OVLOW 中很常见; 学生也经常产生不完善的代码, 教官、 TAs( 或者学生本身)必须找到修复。 在这两种情况下, 开发者的经验都可以大为改善, 如果这种代码可以以某种方式解析和打字; 这使得这种代码更容易在 IDE 中使用, 并允许早期发现和修复潜在错误。 我们引入了宽度的解析器, 它可以分析并打印碎片, 即使是有简单的错误。 训练机器学习者到宽松的、 不完善的和不完善的代码, 需要一个大型的, 包括许多不完善的代码; 这种训练组合受到人类努力和曲解的局限。 在本文中,我们展示了一种新颖的、 间接的监督方法, 来训练一个宽度的解析器, 我们用在 Github 上可以找到的很多正确的代码,, 和基于 NNEF 结构的大规模学习能力, 我们用一个高额的变压的变压的代码, 我们用这个工具可以实现这个数据变压的变压的代码, 和变压的变压的变压的变压的变压的代码, 和变压的变压的变压的变压的变压的系统, 我们用数据能的变压的变压的变压的变的变的模型的变换的变换数据, 数据能的变换数据能的变换数据能的变换的变换的变的变的变的变换数据, 。