Source code summarization of a subroutine is the task of writing a short, natural language description of that subroutine. The description usually serves in documentation aimed at programmers, where even brief phrase (e.g. "compresses data to a zip file") can help readers rapidly comprehend what a subroutine does without resorting to reading the code itself. Techniques based on neural networks (and encoder-decoder model designs in particular) have established themselves as the state-of-the-art. Yet a problem widely recognized with these models is that they assume the information needed to create a summary is present within the code being summarized itself - an assumption which is at odds with program comprehension literature. Thus a current research frontier lies in the question of encoding source code context into neural models of summarization. In this paper, we present a project-level encoder to improve models of code summarization. By project-level, we mean that we create a vectorized representation of selected code files in a software project, and use that representation to augment the encoder of state-of-the-art neural code summarization techniques. We demonstrate how our encoder improves several existing models, and provide guidelines for maximizing improvement while controlling time and resource costs in model size.
翻译:子例程的源代码总和是写出该子例程的简短自然语言描述的任务。描述通常用于针对程序员的文档中,即使是简短的短语(例如“将数据压缩到一个拉链文件”)也能帮助读者快速理解子例程的原理,而不用阅读代码本身。基于神经网络(特别是编码器脱coder模型设计)的技术已经确立自己为最新技术。然而,这些模型所普遍认识到的一个问题是,他们认为创建摘要所需的信息存在于正在被总结的代码中,这一假设与程序理解文献相悖。因此,目前的研究前沿在于编码源代码背景问题,成为合成的神经模型问题。在本文中,我们提出了一个项目级的编码编码编码,以改进代码的模型。在项目一级,我们的意思是,在软件项目项目一级,我们创建了选定代码文档的矢量化表达器,并使用这种表达器来增加国家神经元模型的编码编码编码的编码,同时在控制现有资源成本和最大化时,我们如何在演示各种时间模型的改进中提供。