RepoCoder：通过迭代检索和生成实现代码库级代码完成 (RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation) - 专知论文

会员服务 ·

0

代码 · INFORMS · Automator · Continuity · Performer ·

2023 年 3 月 22 日

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

翻译：RepoCoder：通过迭代检索和生成实现代码库级代码完成

Fengji Zhang,Bei Chen,Yue Zhang,Jin Liu,Daoguang Zan,Yi Mao,Jian-Guang Lou,Weizhu Chen

The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research.

翻译：摘要：代码库级代码完成的任务是基于代码库的更广泛上下文继续编写未完成的代码。虽然对于自动完成工具而言，利用散布在不同文件中的有用信息比较困难，但我们提出了一个名为RepoCoder的简单、通用且有效的框架来应对这一挑战。它通过整合基于相似性的检索器和预训练的代码语言模型，流畅地完成了代码库级代码完成过程，允许有效利用代码库级信息进行代码完成并能够在各种粒度级别上生成代码。此外，RepoCoder采用了一种新颖的迭代检索-生成范式，弥合了检索上下文和预期完成目标之间的差距。我们还提出了一个新的基准RepoEval，包含了覆盖行、API调用和函数体完成场景的最新高质量现实世界代码库。我们使用各种代码检索器和生成器的组合来测试RepoCoder的性能。实验结果表明，RepoCoder在所有设置中将零-shot代码完成基准提高了10％以上，并且始终优于vanilla检索增强的代码完成方法。此外，我们通过全面分析验证了RepoCoder的有效性，为未来的研究提供了有价值的见解。

0

相关内容

代码（Code）是专知网的一个重要知识资料文档板块，旨在整理收录论文源代码、复现代码，经典工程代码等，便于用户查阅下载使用。

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

近期必读的七篇AAAI 2021【问答（QA）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月2日

【ACL2020】用于生成深度问题的语义图，Semantic Graphs for Generating Deep Questions

【ACL2020】用于生成深度问题的语义图，Semantic Graphs for Generating Deep Questions

专知会员服务

26+阅读 · 2020年5月5日

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

专知会员服务

29+阅读 · 2020年3月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

近期必读的9篇CVPR 2019【域自适应（Domain Adaptation）】相关论文和代码

近期必读的9篇CVPR 2019【域自适应（Domain Adaptation）】相关论文和代码

专知会员服务

62+阅读 · 2020年1月10日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

49+阅读 · 2019年11月15日

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

专知会员服务

14+阅读 · 2019年11月11日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

【CIKM2019 Tutorial】Recommendation for Multi-Stakeholders and through Neural Review Mining，附158页PDF免费下载

【CIKM2019 Tutorial】Recommendation for Multi-Stakeholders and through Neural Review Mining，附158页PDF免费下载

专知会员服务

21+阅读 · 2019年11月3日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

函数空间的拓扑分类

国家自然科学基金

1+阅读 · 2014年12月31日

安卓应用开发中模式驱动的代码推荐与完成技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

IMPDH为靶点的小分子抑制剂的设计、合成及活性研究

国家自然科学基金

0+阅读 · 2012年12月31日

趋化因子诱骗受体DARC通过清除微环境中CCL28抑制MSL型三阴性乳腺癌增殖侵袭的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

多尺度DEM一致性地形特征提取与自适应转换

国家自然科学基金

0+阅读 · 2011年12月31日

基于特征建模优化与判别学习的Web spam识别技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

核壳双金属纳米粒子的可控制备、原子结构和物性研究

国家自然科学基金

0+阅读 · 2009年12月31日

三维模型语义分析与检索研究

国家自然科学基金

2+阅读 · 2008年12月31日

多文种文档图像识别的多层次马尔可夫随机场模型研究

国家自然科学基金

1+阅读 · 2008年12月31日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月12日

Active Retrieval Augmented Generation

Arxiv

0+阅读 · 2023年5月11日

A First Look at LLM-Powered Generative News Recommendation

Arxiv

0+阅读 · 2023年5月11日

CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors

Arxiv

1+阅读 · 2023年5月11日

Evaluating Embedding APIs for Information Retrieval

Arxiv

0+阅读 · 2023年5月10日

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Arxiv

0+阅读 · 2023年5月9日

StarCoder: may the source be with you!

Arxiv

1+阅读 · 2023年5月9日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Text Generation from Knowledge Graphs with Graph Transformers

Arxiv

35+阅读 · 2019年4月4日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

【CVPR 2022】基于Tracklet查询和建议的高效视频实例分割，Efficient Video Instance Segmentation via Tracklet Query and Proposal

专知会员服务

16+阅读 · 2022年3月3日

近期必读的七篇AAAI 2021【问答（QA）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月2日

【ACL2020】用于生成深度问题的语义图，Semantic Graphs for Generating Deep Questions

【ACL2020】用于生成深度问题的语义图，Semantic Graphs for Generating Deep Questions

专知会员服务

26+阅读 · 2020年5月5日

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

【CVPR2020-Oral-牛津-Facebook】从单个图像进行端到端的视图合成，SynSin-View Synthesis

专知会员服务

29+阅读 · 2020年3月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

近期必读的9篇CVPR 2019【域自适应（Domain Adaptation）】相关论文和代码

近期必读的9篇CVPR 2019【域自适应（Domain Adaptation）】相关论文和代码

专知会员服务

62+阅读 · 2020年1月10日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

49+阅读 · 2019年11月15日

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

专知会员服务

14+阅读 · 2019年11月11日

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

【Google论文】ALBERT:自我监督学习语言表达的精简BERT

专知会员服务

24+阅读 · 2019年11月4日

【CIKM2019 Tutorial】Recommendation for Multi-Stakeholders and through Neural Review Mining，附158页PDF免费下载

【CIKM2019 Tutorial】Recommendation for Multi-Stakeholders and through Neural Review Mining，附158页PDF免费下载

专知会员服务

21+阅读 · 2019年11月3日

热门VIP内容

开通专知VIP会员享更多权益服务

网络科学赋能人工智能: 现状与展望

【NeurIPS2025教程】解释人工智能模型：可解释人工智能、数据中心人工智能与机制可解释性的方法与机遇

人工智能赋能作战行动：以俄乌战争为例

【ETHZ博士论文】表征学习在推进深度学习中的作用：效率、可扩展性与推理

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

相关论文

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月12日

Active Retrieval Augmented Generation

Arxiv

0+阅读 · 2023年5月11日

A First Look at LLM-Powered Generative News Recommendation

Arxiv

0+阅读 · 2023年5月11日

CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors

Arxiv

1+阅读 · 2023年5月11日

Evaluating Embedding APIs for Information Retrieval

Arxiv

0+阅读 · 2023年5月10日

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Arxiv

0+阅读 · 2023年5月9日

StarCoder: may the source be with you!

Arxiv

1+阅读 · 2023年5月9日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Text Generation from Knowledge Graphs with Graph Transformers

Arxiv

35+阅读 · 2019年4月4日

相关基金

函数空间的拓扑分类

国家自然科学基金

1+阅读 · 2014年12月31日

安卓应用开发中模式驱动的代码推荐与完成技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

IMPDH为靶点的小分子抑制剂的设计、合成及活性研究

国家自然科学基金

0+阅读 · 2012年12月31日

趋化因子诱骗受体DARC通过清除微环境中CCL28抑制MSL型三阴性乳腺癌增殖侵袭的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Linked Open Data的Web服务语义互操作关键技术

国家自然科学基金

0+阅读 · 2012年12月31日

多尺度DEM一致性地形特征提取与自适应转换

国家自然科学基金

0+阅读 · 2011年12月31日

基于特征建模优化与判别学习的Web spam识别技术研究

国家自然科学基金

0+阅读 · 2011年12月31日

核壳双金属纳米粒子的可控制备、原子结构和物性研究

国家自然科学基金

0+阅读 · 2009年12月31日

三维模型语义分析与检索研究

国家自然科学基金

2+阅读 · 2008年12月31日

多文种文档图像识别的多层次马尔可夫随机场模型研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员