Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program functionality. However, we show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form. ContraCode pre-trains a neural network to identify functionally similar variants of a program among many non-equivalent distractors. We scalably generate these variants using an automated source-to-source compiler as a form of data augmentation. Contrastive pre-training improves JavaScript summarization and TypeScript type inference accuracy by 2% to 13%. We also propose a new zero-shot JavaScript code clone detection dataset, showing that ContraCode is both more robust and semantically meaningful. On it, we outperform RoBERTa by 39% AUROC in an adversarial setting and up to 5% on natural code.
翻译:最近的工作通过从上下文中重建符号来学习源代码的背景描述。 对于下游语义理解任务, 比如用英语总结代码, 这些表达应该最理想地捕捉程序功能 。 然而, 我们显示流行的基于重建的 BERT 模型对源代码编辑十分敏感, 即使编辑保存语义 。 我们提议 ContraCode: 对比式的训练前任务, 学习代码功能, 而不是形式 。 ContraCode 之前的线性网络可以识别许多非等值分散器中程序功能相似的变体。 我们使用自动源到源的编译器生成这些变体, 作为一种数据增强形式。 对比性培训前模型将 JavaScript 的拼写和类型Script 类型推导精度提高2%到13% 。 我们还提出一个新的零弹射 JavaScript 代码克隆探测数据集, 显示 ContraCode 既更强大, 也具有语义意义。 在它上, 我们将 RoBERTA 39% 以 AM AM ABUROC 。