GitHub is a popular data repository for code examples. It is being continuously used to train several AI-based tools to automatically generate code. However, the effectiveness of such tools in correctly demonstrating the usage of cryptographic APIs has not been thoroughly assessed. In this paper, we investigate the extent and severity of misuses, specifically caused by incorrect cryptographic API call sequences in GitHub. We also analyze the suitability of GitHub data to train a learning-based model to generate correct cryptographic API call sequences. For this, we manually extracted and analyzed the call sequences from GitHub. Using this data, we augmented an existing learning-based model called DeepAPI to create two security-specific models that generate cryptographic API call sequences for a given natural language (NL) description. Our results indicate that it is imperative to not neglect the misuses in API call sequences while using data sources like GitHub, to train models that generate code.
翻译:GitHub 是一个流行的代码示例数据储存库。 它正在持续用于培训多个基于 AI 工具以自动生成代码。 但是, 尚未对此类工具在正确演示加密API的使用方面的有效性进行彻底评估。 在本文中, 我们调查滥用的程度和严重性, 特别是由于GitHub 的加密API调用序列不正确造成的滥用。 我们还分析了 GitHub 数据是否适合培训学习模型以生成正确的加密API调用序列。 为此, 我们手工提取并分析了来自 GitHub 的调用序列。 我们利用这些数据, 扩展了现有的名为 DeepAPI 的基于学习模型, 以创建两个特定安全模式, 生成特定自然语言描述的加密API 调用序列。 我们的结果表明, 在使用 GitHub 等数据源来培训生成代码的模型时, 绝对不能忽略 API 调用序列中的滥用 。