According to GitGuardian's monitoring of public GitHub repositories, the exposure of secrets (API keys and other credentials) increased two-fold in 2021 compared to 2020, totaling more than six million secrets. However, no benchmark dataset is publicly available for researchers and tool developers to evaluate secret detection tools that produce many false positive warnings. The goal of our paper is to aid researchers and tool developers in evaluating and improving secret detection tools by curating a benchmark dataset of secrets through a systematic collection of secrets from open-source repositories. We present a labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of various secret types extracted from 818 public GitHub repositories. The dataset covers 49 programming languages and 311 file types.
翻译:根据GitGuardian对GitHub公共储存库的监测,与2020年相比,2021年秘密(API钥匙和其他证书)暴露量增加了两倍,总共超过600万个秘密,然而,研究人员和工具开发者无法公开获得基准数据集,以评价产生许多虚假正面警告的秘密探测工具,我们论文的目的是协助研究人员和工具开发者通过系统收集公开来源储存库的机密来帮助评估和改进秘密探测工具的基准数据集。我们提供了一组有标签的源码数据集,其中包括从818个GitHub公共储存库提取的各种秘密类型97 479个秘密(其中15 084个是真实的秘密)。该数据集涵盖49种编程语言和311个档案类型。</s>