Primitive types are fundamental components available in any programming language, which serve as the building blocks of data manipulation. Understanding the role of these types in source code is essential to write software. Little work has been conducted on how often these variables are documented in code comments and what types of knowledge the comments provide about variables of primitive types. In this paper, we present an approach for detecting primitive variables and their description in comments using lexical matching and advanced matching. We evaluate our approaches by comparing the lexical and advanced matching performance in terms of recall, precision, and F-score, against 600 manually annotated variables from a sample of GitHub projects. The performance of our advanced approach based on F-score was superior compared to lexical matching, 0.986 and 0.942, respectively. We then create a taxonomy of the types of knowledge contained in these comments about variables of primitive types. Our study showed that developers usually documented the variables' identifiers of a numeric data type with their purpose~(69.16%) and concept~(72.75%) more than the variables' identifiers of type String which were less documented with purpose~(61.14%) and concept~(55.46%). Our findings characterise the current state of the practice of documenting primitive variables and point at areas that are often not well documented, such as the meaning of boolean variables or the purpose of fields and local variables.
翻译:原始类型是任何编程语言的基本组成部分, 它们是数据操纵的基石。 了解这些类型在源代码中的角色是写软件的关键。 我们很少研究这些变量在代码评论中记录的次数以及这些变量对原始类型变量提供的知识类型。 在本文中, 我们提出一种方法, 使用词汇匹配和高级匹配来检测原始变量及其在评论中描述。 我们通过比较在回溯、 精确度和 F- Score 方面的词汇和高级匹配性能来评估我们的方法, 比较600个来自 GitHub 项目样本的人工附加说明性变量。 我们基于 F- Score 的先进方法的性能优于分别在代码匹配、 0. 986 和 0. 942 中记录这些变量的频率。 我们的研究显示, 开发者通常用数字数据类型中的变量标识来记录它们的目的~ (69.16 %) 和概念~ (72. 75 %) 多于类型字符串的变量标识, 这些变量的目的不那么目的~ (61. 14 %) 和 原始变量的特性区域, 通常记录为: 55 的当前定义的特性和正变数区域。