Program code contains functions, variables, and data structures that are represented by names. To promote human understanding, these names should describe the role and use of the code elements they represent. But the names given by developers show high variability, reflecting the tastes of each developer, with different words used for the same meaning or the same words used for different meanings. This makes comparing names hard. A precise comparison should be based on matching identical words, but also take into account possible variations on the words (including spelling and typing errors), reordering of the words, matching between synonyms, and so on. To facilitate this we developed a library of comparison functions specifically targeted to comparing names in code. The different functions calculate the similarity between names in different ways, so a researcher can choose the one appropriate for his specific needs. All of them share an attempt to reflect human perceptions of similarity, at the possible expense of lexical matching.
翻译:程序代码包含由名称代表的函数、 变量和数据结构。 为促进人类理解, 这些名称应该描述它们所代表的代码元素的作用和用途。 但是开发者提供的名称显示的可变性很大, 反映了每个开发者的口味, 不同词用于相同的含义或不同含义的相同含义。 这就使得比较名称非常困难。 精确的比较应该基于对等词, 但也要考虑到单词( 包括拼写和打字错误) 的可能变化, 重新排序单词, 同义词之间的匹配等等。 为了便利这个过程, 我们专门为比较代码中的名称开发了一个比较功能库。 不同的函数以不同的方式计算名称之间的相似性, 所以研究人员可以选择适合其具体需要的词。 所有这些函数都试图反映人类对相似性的看法, 并可能花费词汇匹配的费用 。