Cotector: 保护开放源码,防止未经授权使用含有数据中毒的培训 (CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning)

Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to provide powerful intelligence to help developers implement safe and effective code, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. To this end, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.

翻译：在数十亿条公共代码方面受过培训的Github Copilot最近成为计算机科学研究和实践界的传奇词。虽然它旨在提供强有力的情报,帮助开发者实施安全有效的代码,但实践者和研究人员对其伦理和安全问题提出了关切,例如,如果复制的许可代码被自由使用,或者首先考虑将不安全代码用于培训?这些问题对通过深层次学习模型从大型源代码中学习知识的Copil和其他类似产品产生了重大影响,这些产品随着人工智能的快速发展而不可避免地在上升。我们争辩说,为了减轻这种影响,需要创建有效的机制,保护开放源代码不被深层学习模型所利用。为此,我们设计和实施了一个原型,即Cotector,它利用数据中毒技术来武装源代码库,以抵御这种开发。我们的大规模实验经验显示,Cotropectorat在实现其目的方面是有效的,大大降低了Copil-micle学习模型的性能,同时能够准确地揭示隐性水印的后门。