As code search permeates most activities in software development,code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for refactoring, patch identification for program repair, and language translation. Existing code-to-code search tools rely on static similarity approaches such as the comparison of tokens and abstract syntax trees (AST) to approximate dynamic behavior, leading to low precision. Most tools do not support cross-language code-to-code search, and those that do, rely on machine learning models that require labeled training data. We present Code-to-Code Search Across Languages (COSAL), a cross-language technique that uses both static and dynamic analyses to identify similar code and does not require a machine learning model. Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity. We empirically evaluate COSAL on two datasets of 43,146Java and Python files and 55,499 Java files and find that 1) code search based on non-dominated ranking of static and dynamic similarity measures is more effective compared to single or weighted measures; and 2) COSAL has better precision and recall compared to state-of-the-art within-language and cross-language code-to-code search tools. We explore the potential for using COSAL on large open-source repositories and discuss scalability to more languages and similarity metrics, providing a gateway for practical,multi-language code-to-code search.
翻译:由于代码搜索贯穿软件开发的大多数活动,因此出现了代码到代码搜索,以支持使用代码作为查询,并在搜索结果中检索类似的代码。应用程序包括用于再构码的重复代码检测、用于程序修理的补丁识别和语言翻译。现有的代码到代码搜索工具依靠静态相似的方法,如将象征和抽象语法树(AST)与近似动态行为进行比较,导致低精确度。大多数工具不支持跨语言代码到代码搜索,以及那些使用需要标签培训数据的机器学习模型的工具。我们提供了代码到代码搜索跨语言的重复代码检测,这是一种使用静态和动态分析来确定类似代码的跨语言技术,而不需要机器学习模型。代码片段的排序采用基于代码相似性、结构相似性和行为相似性的非主流排序方法,从而导致动态行为。我们从经验上评价了两个公开的代码,43、146Java和Sython文档以及55、499 Java文档中的机器学习模型模型。我们发现,我们使用基于不固定和动态和动态语言的常规和动态语言的比较标准搜索方法,在非固定和动态到动态语言的常规级别上进行了比较的比较的常规和动态-比较的常规和内部的标准化和语言的系统比级级级级的系统――比较标准和语言的系统级级级比级比级比级的常规和语言的系统级搜索,提供了比级搜索,提供了比级码-S-级级级级级码-比级级级级级级比级搜索,提供了比级的系统级码-比级和比级码-级码-级码-级码-比级码-级码-级级级级级级和级码-比级级级级和级级级级级级级和级级级级级级级级级级比级比级比级比级码-级级和级级级级级级级级和级级级级级级级级级级级级级的C级的C级级级和级级级级和C级和级级级级级和级和级和级级级和级级级级级级级级级级级级级级级级级级级级级级级级级级级级和级和级级级级级级级级级级和级和级级级和级级级级级