Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying on parametric knowledge without leveraging CTI knowledge bases; (ii) they cover only a narrow set of tasks, lacking a systematic view of the CTI landscape; and (iii) they restrict evaluation to single-source analysis, unlike realistic scenarios that require reasoning across multiple sources. To fill these gaps, we present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI under knowledge-augmented settings. CTIArena spans three categories, structured, unstructured, and hybrid, further divided into nine tasks that capture the breadth of CTI analysis in modern security operations. We evaluate ten widely used LLMs and find that most struggle in closed-book setups but show noticeable gains when augmented with security-specific knowledge through our designed retrieval-augmented techniques. These findings highlight the limitations of general-purpose LLMs and the need for domain-tailored techniques to fully unlock their potential for CTI.
翻译:网络威胁情报(CTI)是现代网络安全的核心,为检测和缓解不断演变的威胁提供关键洞察。随着大语言模型(LLM)在自然语言理解与推理能力上的突破,将其应用于CTI领域的兴趣日益增长,这迫切需要能够严格评估其性能的基准测试。已有若干早期研究针对部分CTI任务评估了LLM,但仍存在局限:(一)仅采用闭卷测试设置,依赖参数化知识而未利用CTI知识库;(二)仅覆盖有限任务类型,缺乏对CTI全景的系统性考察;(三)评估局限于单源分析,与需要跨多源推理的实际场景不符。为填补这些空白,我们提出CTIArena——首个在知识增强设置下评估LLM处理异构多源CTI性能的基准。CTIArena涵盖结构化、非结构化和混合型三大类别,进一步细分为九项任务,全面覆盖现代安全运营中CTI分析的广度。我们评估了十种广泛使用的LLM,发现多数模型在闭卷设置中表现欠佳,但通过我们设计的检索增强技术注入安全领域知识后,性能获得显著提升。这些发现揭示了通用LLM的局限性,并表明需要领域定制化技术以充分释放其在CTI领域的潜力。