注释代码“翻译”的代码:数据、计量、基准和评估 (Code to Comment "Translation": Data, Metrics, Baselining & Evaluation)

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep learning methods to this task, and specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using "affinity pairs" of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

翻译：相关评论与代码之间的关系,特别是生成该代码的有用评论的任务,长期以来一直令人感兴趣。最早的方法基于强有力的评论结构综合理论,并依靠文本模板。最近,研究人员对这项任务应用了深层次的学习方法,特别是据知对自然语言翻译(例如从德语到英语)非常有用的可训练的基因化翻译模型。我们仔细研究了基本假设:生成评论的任务充分类似于翻译自然语言的任务,因此可以使用类似的模型和评价衡量标准。我们分析了这项工作的最新代码数据集: 代码、深水、幽默和 DocString。我们将其与WMT19(一个常见用于培训自然语言翻译的艺术状态的标准数据集)。我们发现代码- 数据与 WMT19 自然语言数据之间存在一些有趣的差异。接下来,我们描述和进行一些研究以校准 BLEU(通常用来测量评论质量 ) 。我们在“ 亲密关系” 、 DeepCom、 FunCom 和 DocString 上分析最近的一些代码数据集。我们用了一些方法来比较这些方法, 。我们用这些方法来分析这些方法, 如何在将来的精确地评估我们的研究中, 如何改进我们的任务。