Free/Open Source Software (FOSS) enables large-scale reuse of preexisting software components. The main drawback is increased complexity in software supply chain management. A common approach to tame such complexity is automated open source compliance, which consists in automating the verication of adherence to various open source management best practices about license obligation fulllment, vulnerability tracking, software composition analysis, and nearby concerns.We consider the problem of auditing a source code base to determine which of its parts have been published before, which is an important building block of automated open source compliance toolchains. Indeed, if source code allegedly developed in house is recognized as having been previously published elsewhere, alerts should be raised to investigate where it comes from and whether this entails that additional obligations shall be fullled before product shipment.We propose an ecient approach for prior publication identication that relies on a knowledge base of known source code artifacts linked together in a global Merkle direct acyclic graph and a dedicated discovery protocol. We introduce swh-scanner, a source code scanner that realizes the proposed approach in practice using as knowledge base Software Heritage, the largest public archive of source code artifacts. We validate experimentally the proposed approach, showing its eciency in both abstract (number of queries) and concrete terms (wall-clock time), performing benchmarks on 16 845 real-world public code bases of various sizes, from small to very large.
翻译:自由/开放源码软件(FOSS)能够大规模地重新使用先前存在的软件组件。主要的缺点是软件供应链管理的复杂性增加。一个共同的方法是自动化开放源码的合规性,这包括自动遵守各种开放源管理的最佳做法,包括自动遵守各种开放源码管理的最佳做法,如许可证义务的完整、脆弱性跟踪、软件构成分析以及附近问题。我们考虑审计源代码基础以确定其哪些部分以前已经出版过,这是自动化开放源合规工具链的重要基石。事实上,如果在内部开发的源码被确认为以前在其他地方已经出版过,就应提醒调查源码的来源代码来自何处,这是否意味着在产品发运之前必须全面履行额外义务。我们建议了事先公布信息的科学方法,该方法依赖于全球Merkle直接的环球图和专门的发现协议中连接的已知源码知识库。我们引入了Swh-scanner,这是源码扫描器,它将提议的做法从软件遗产知识库、最大公共档案库到最大数字序列8号。我们用16号的大规模源码来进行测试。