Comments are an important part of the source code and are a primary source of documentation. This has driven interest in using large bodies of comments to train or evaluate tools that consume or produce them -- such as generating oracles or even code from comments, or automatically generating code summaries. Most of this work makes strong assumptions about the structure and quality of comments, such as assuming they consist mostly of proper English sentences. However, we know little about the actual quality of existing comments for these use cases. Comments often contain unique structures and elements that are not seen in other types of text, and filtering or extracting information from them requires some extra care. This paper explores the contents and quality of Python comments drawn from 840 most popular open source projects from GitHub and 8422 projects from SriLab dataset, and the impact of na\"ive vs. in-depth filtering can have on the use of existing comments for training and evaluation of systems that generate comments.
翻译:评论是源代码的一个重要部分,是文件的主要来源。这促使人们有兴趣使用大量评论来训练或评价消耗或产生这些评论的工具 -- -- 例如产生神器,甚至从评论中生成代码,或自动生成代码摘要。大多数这类工作对评论的结构和质量作了强有力的假设,例如假定评论大部分是适当的英文句子。然而,我们对这些使用案例的现有评论的实际质量知之甚少。评论往往包含其他类型文本中看不到的独特结构和要素,过滤或从中提取信息需要格外小心。本文探讨了来自吉特胡布840个最受欢迎的开放源头项目和斯里拉布数据集8422个项目的Python评论的内容和质量,以及“深入过滤”对现有评论用于培训和评价产生评论的系统的影响。