Developers use search for various tasks such as finding code, documentation, debugging information, etc. In particular, web search is heavily used by developers for finding code examples and snippets during the coding process. Recently, natural language based code search has been an active area of research. However, the lack of real-world large-scale datasets is a significant bottleneck. In this work, we propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages. We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine and show that the CNN based model can achieve an accuracy of 77% and 76% for C# and Java respectively. Furthermore, we are also releasing Search4Code, the first large-scale real-world dataset of code search queries mined from Bing web search engine. We hope that the dataset will aid future research on code search.
翻译:开发者对各种任务进行搜索,例如查找代码、文档、调试信息等。 特别是, 网络搜索被开发者在编码过程中大量用于查找代码示例和片断。 最近, 以自然语言为基础的代码搜索是一个活跃的研究领域。 然而, 缺乏真实世界大型数据集是一个巨大的瓶颈。 在这项工作中, 我们提出基于监督的薄弱方法, 用于在搜索 C# 和 Java 编程语言时探测代码搜索意图。 我们对照由 Bing 网络搜索引擎所提取的100多万个查询组成的真实世界数据集的若干基线来评估该方法, 并显示基于CNN 的模型可以分别实现C# 和 Java 的77%和76%的准确性。 此外, 我们还发布搜索4Code, 这是从 Bing 网络搜索引擎所提取的首个大规模真实世界代码搜索查询的数据集。 我们希望该数据集将有助于未来的代码搜索研究。