Malicious packages in public registries pose serious threats to software supply chain security. While current software component analysis (SCA) tools rely on databases like OSV and Snyk to detect these threats, these databases suffer from delayed updates and incomplete coverage. However, they miss intelligence from unstructured sources like social media and developer forums, where new threats are often first reported. This delay extends the lifecycle of malicious packages and increases risks for downstream users. To address this, we developed a novel and comprehensive approach to construct a platform IntelliRadar to collect disclosed malicious package names from unstructured web content. Specifically, by exhaustively searching and snowballing the public sources of malicious package names, and incorporating large language models (LLMs) with domain-specialized Least to Most prompts, IntelliRadar ensures comprehensive collection of historical and current disclosed malicious package names from diverse unstructured sources. As a result, we constructed a comprehensive malicious package database containing 34,313 malicious NPM and PyPI package names. Our evaluation shows that IntelliRadar achieves high performance (97.91% precision) on malicious package intelligence extraction. Compared to existing databases, IntelliRadar identifies 7,542 more malicious package names than OSV and 12,684 more than Snyk. Furthermore, 76.6% of NPM components and 70.3% of PyPI components in IntelliRadar were collected earlier than in Snyk's database. IntelliRadar is also more cost-efficient, with a cost of $0.003 per piece of malicious package intelligence and only $7 per month for continuous monitoring. Furthermore, we identified and received confirmation for 1,981 malicious packages in downstream package manager mirror registries through the IntelliRadar.
翻译:公共注册表中的恶意软件包对软件供应链安全构成严重威胁。当前软件组件分析(SCA)工具依赖OSV和Snyk等数据库来检测这些威胁,但这些数据库存在更新延迟和覆盖不全的问题。更重要的是,它们遗漏了社交媒体和开发者论坛等非结构化来源的情报,而新威胁往往首先在这些渠道被报告。这种延迟延长了恶意软件包的生命周期,增加了下游用户的风险。为解决这一问题,我们开发了一种新颖且全面的方法,构建了IntelliRadar平台,用于从非结构化网络内容中收集已披露的恶意软件包名称。具体而言,通过穷尽搜索和滚雪球式追踪恶意软件包名称的公开来源,并结合大型语言模型(LLMs)与领域专业化的“从最少到最多”提示策略,IntelliRadar确保了从多样化非结构化来源中全面收集历史及当前已披露的恶意软件包名称。最终,我们构建了一个包含34,313个恶意NPM和PyPI软件包名称的综合性恶意软件包数据库。评估结果表明,IntelliRadar在恶意软件包情报提取上实现了高性能(97.91%的精确率)。与现有数据库相比,IntelliRadar比OSV多识别出7,542个恶意软件包名称,比Snyk多识别出12,684个。此外,IntelliRadar中76.6%的NPM组件和70.3%的PyPI组件比Snyk数据库更早被收录。IntelliRadar还具有更高的成本效益,每条恶意软件包情报的成本为0.003美元,持续监控每月仅需7美元。进一步地,通过IntelliRadar,我们在下游包管理器镜像注册表中识别并获得了1,981个恶意软件包的确认。