Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.
翻译:在数十亿条公共代码方面受过培训的Github Copilot最近已成为计算机科学研究和实践界的传奇词,尽管它旨在帮助开发者与强大的情报界执行安全有效的代码,但实践者和研究人员对其伦理和安全问题提出了关切,例如,如果复制的许可代码首先被自由利用,或者考虑在培训中自由使用或不安全代码?这些问题对Copilot和其他类似产品产生重大影响,这些产品旨在通过深层次学习模型从大型开放源代码中学习知识,随着人工智能的快速发展,这些模型不可避免地在上升。为了减轻这种影响,我们主张需要发明有效的机制,保护开放源代码不被深层学习模型利用。在这里,我们设计并实施一个原型,即CoProtector,它利用数据中毒技术来武装源代码库,以抵御这种开发。我们的大规模实验实验表明,Cprotector在实现其目的方面是有效的,大大降低了Copil-micle学习模型的性能,同时能够准确地揭示秘密嵌入的后门水印。