Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation -- neural code search -- is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes.
翻译:重新实施以前解决的软件工程问题的解决办法不仅效率低,而且引入了不完善和容易出错的代码。许多现有方法通过使用经过代码培训的自动递减文本生成模型,在这个问题上取得了令人印象深刻的成绩。但是,这些方法并非没有缺陷。这些模型生成的代码可能是错误的,缺乏文件,并引入了开发者可能忽视的弱点。代码生成的替代方法 -- -- 神经代码搜索 -- -- 是机器学习的一个领域,模型将自然语言查询作为输入,反过来,从数据库返回相关的代码样本。由于这个原有数据库的性质,代码样本可以记录、测试、许可和检查脆弱性,然后由开发者在生产过程中使用。在这项工作中,我们介绍了代码生成者对代码搜索的一种端对端到端的统一方法。代码创建者学会受过培训,可以将自然语言查询直接映射到各自的代码样本,这些样本稍后可以检索。为了改进代码搜索的绩效,我们调查了剂量代表战略的变化,标识对 docid结构的影响,以及数据设置的大小对总体代码搜索性能超过常规的基线。我们通过常规的 CDSDSBB 显示超过常规基准。