The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7% for Java, 89.6% for Python, and 90.4% for JavaScript, enables users to find relevant code changes more effectively than a regular expression-based search, and is helpful for gathering a large-scale dataset of real-world bug fixes.
翻译:成功项目的源代码正在不断演变, 从而在源代码库中存储数十万个代码变化。 这种丰富的数据可以有用, 例如, 以查找类似于计划代码变化的更改或重复代码改进的示例。 本文展示了 DiffSearch, 这是一个搜索引擎, 该搜索引擎, 在一个描述代码变化的查询中, 返回一系列与查询匹配的修改 。 方法由三种关键贡献启用 。 首先, 我们展示了一种查询语言, 将基本程序语言与通配卡和占位符相扩展, 提供了一种容易适应不同编程语言的直观查询方式。 其次, 为确保可缩放性, 方法索引代码在一次性预处理步骤中更改, 将其映射为功能空间, 然后在每次查询的功能空间中进行高效搜索 。 第三, 保证任何返回的代码变化确实与给定的查询相符 。 我们展示了一种基于树本的匹配算法, 检查是否可以将一个有用的查询扩展为具体的代码变化 。 我们为 Java、 JavaScripet、 Python 4 用户的大型搜索代码修改, 在 80% 的代码中, 的代码中, 将可有效读取出一个代码修正到一个80秒, 。