Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of code search. Last, we define intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data, to help understand the experimental findings.
翻译:语义代码搜索是指为特定自然语言查询找到具有语义相关性的代码片断。 在最先进的方法中,代码和查询之间的语义相似性被量化为其在共享矢量空间中的表达距离。在本文中,为了改进矢量空间,我们引入简化的 AST 格式的树序列法,并为代码数据建立多式表达方式。我们使用一个大规模和多语言的单一体进行广泛的实验: CodeSearchNet。我们的结果显示,我们的树序列式表达方式和多式学习模式都改善了代码搜索的性能。最后,我们定义了直观的量化指标,其导向是代码数据的语义和合成信息的完整性,以帮助理解实验结果。