Deep learning has demonstrated great abilities in various code generation tasks. However, despite the great convenience for some developers, many are concerned that the code generators may recite or closely mimic copyrighted training data without user awareness, leading to legal and ethical concerns. To ease this problem, we introduce a tool, named WhyGen, to explain the generated code by referring to training examples. Specifically, we first introduce a data structure, named inference fingerprint, to represent the decision process of the model when generating a prediction. The fingerprints of all training examples are collected offline and saved to a database. When the model is used at runtime for code generation, the most relevant training examples can be retrieved by querying the fingerprint database. Our experiments have shown that WhyGen is able to precisely notify the users about possible recitations and highly similar imitations with a top-10 accuracy of 81.21%. The demo video can be found at https://youtu.be/EtoQP6850To.
翻译:深层学习在各种代码生成任务中表现出了巨大的能力。然而,尽管对于一些开发者来说,尽管对一些开发者来说有很大的便利,但许多人担心代码生成者可能会背诵或密切模仿版权培训数据,而没有用户意识,从而引发法律和伦理问题。为了缓解这一问题,我们引入了一个名为“WhisGen”的工具,通过提及培训实例来解释生成的代码。具体地说,我们首先引入了一个数据结构,名为“推断指纹”,以代表模型的决策过程,然后作出预测。所有培训实例的指纹都从网上收集并保存到数据库中。当该模型在运行时用于代码生成时,可以通过查询指纹数据库来检索最相关的培训实例。我们的实验表明,为什么Gen能够准确告知用户可能的重复和非常相似的仿真,最高精确度为81.21%。演示视频可以在 https://youtu.be/EtoQP6850To网站上找到。