Identifiers, such as method and variable names, form a large portion of source code. Therefore, low-quality identifiers can substantially hinder code comprehension. To support developers in using meaningful identifiers, several (semi-)automatic techniques have been proposed, mostly being data-driven (e.g. statistical language models, deep learning models) or relying on static code analysis. Still, limited empirical investigations have been performed on the effectiveness of such techniques for recommending developers with meaningful identifiers, possibly resulting in rename refactoring operations. We present a large-scale study investigating the potential of data-driven approaches to support automated variable renaming. We experiment with three state-of-the-art techniques: a statistical language model and two DL-based models. The three approaches have been trained and tested on three datasets we built with the goal of evaluating their ability to recommend meaningful variable identifiers. Our quantitative and qualitative analyses show the potential of such techniques that, under specific conditions, can provide valuable recommendations and are ready to be integrated in rename refactoring tools. Nonetheless, our results also highlight limitations of the experimented approaches that call for further research in this field.
翻译:诸如方法和可变名称等标识符构成源代码的很大一部分。因此,低质量标识符会大大妨碍代码理解。为了支持开发者使用有意义的标识符,提出了几种(半半)自动技术,大多是数据驱动的(例如统计语言模型、深层学习模型)或依靠静态代码分析。不过,对于这类技术对建议具有有意义的标识符的开发者的有效性进行了有限的实证调查,可能会导致重命名操作。我们提出了一项大规模研究,调查数据驱动方法支持自动变量重新命名的潜力。我们试验了三种最先进的技术:一种统计语言模型和两种基于DL的模型。三种方法都经过了培训和测试,我们建立于三个数据集的目的是评估它们推荐有意义的变量识别符的能力。我们的定量和定性分析显示了这些技术的潜力,在特定条件下,这些技术可以提供有价值的建议,并准备纳入重新命名工具。然而,我们的结果也突出了要求在这一领域进行进一步研究的实验方法的局限性。