Training a deep learning model on source code has gained significant traction recently. Since such models reason about vectors of numbers, source code needs to be converted to a code representation and then will be transformed into vectors. Numerous approaches have been proposed to represent source code, from sequences of tokens to abstract syntax trees. However, there is no systematic study to understand the effect of code representation on learning performance. Through a controlled experiment, we examine the impact of various code representations on model accuracy and usefulness in learning-based program repair. We train 21 different models, including 14 different homogeneous code representations, four mixed representations for the buggy and fixed code, and three different embeddings. We also conduct a user study to qualitatively evaluate the usefulness of inferred fixes in different code representations. Our results highlight the importance of code representation and its impact on learning and usefulness. Our findings indicate that (1) while code abstractions help the learning process, they can adversely impact the usefulness of inferred fixes from a developer's point of view; this emphasizes the need to look at the patches generated from the practitioner's perspective, which is often neglected in the literature, (2) mixed representations can outperform homogeneous code representations, (3) bug type can affect the effectiveness of different code representations; although current techniques use a single code representation for all bug types, there is no single best code representation applicable to all bug types.
翻译:源代码培训的深层次学习模式最近获得了显著的推动。 由于这些模型对数字矢量的原因,源代码需要转换为代码代表,然后将转换为矢量。 已经提出了许多方法来代表源代码, 从象征序列到抽象的语法树。 但是,没有系统的研究来理解代码代表对学习绩效的影响。 我们通过一个受控实验, 审查了各种代码代表对基于学习的方案修复模型准确性和实用性的影响。 我们培训了21个不同的模型, 包括14个不同的同质代码代表, 4个错误和固定代码的混合代表, 以及3个不同的嵌入。 我们还进行了用户研究, 对不同代码代表中推断的修改的实用性进行了质量评估。 我们的结果强调了代码代表的重要性及其对学习和实用性的影响。 我们的研究结果表明:(1) 虽然代码抽象有助于学习过程, 但是它们可能会对从开发者的角度推断的校正方法的有用性产生不利的影响。 我们强调需要从从实践者的角度来查看所形成的部分, 而在文献中经常被忽略的, (2) 混合的表达方式可以影响所有不同的单一代码类型。