Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search. Last, we define two intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data.
翻译:语义代码搜索是指为特定自然语言查询找到具有语义相关性的代码片断。 在最先进的方法中,代码和查询之间的语义相似性被量化为它们在共享矢量空间中的表达距离。在本文中,为了改进矢量空间,我们引入了简化的 AST 格式的树木序列化方法,并为代码数据构建了多式表达方式。我们使用一个大规模和多种语言的单一体进行广泛的实验: CodeSearchNet。我们的结果显示,我们的树木序列化表达方式和多式学习模式都改善了神经代码搜索的性能。最后,我们定义了两种直观量化指标,以代码数据的语义和合成信息的完整性为导向。