The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its potential relevance to the task at hand. The proposed architecture is realised using exclusively open-source tools, and a preliminary evaluation with crowdsourced results demonstrates its effectiveness.
翻译:清晰、社会和黑暗的网络最近被确定为宝贵的网络安全信息的丰富来源 — — 提供了适当的工具和方法,这些工具和方法可以被识别、爬行并随后被运用到可操作的网络威胁情报中。 在这项工作中,我们侧重于信息收集任务,并展示了一个新的爬行架构,以透明的方式从清晰的网络的安全网站、社交网络的安全论坛和黑客论坛/黑客市场收集数据。拟议架构对数据采集采取了两阶段方法。最初,一个基于机器学习的爬行器被用于将采集引向感兴趣的网站,而在第二阶段,使用最先进的统计语言建模技术在潜伏的低维特质空间中代表所采集的信息,并根据这些信息与当前任务的潜在关联性对其进行排名。 拟议的架构是使用完全开放源工具实现的,并用众源成果进行初步评估,显示了其有效性。