The hidden nature and the limited accessibility of the Dark Web, combined with the lack of public datasets in this domain, make it difficult to study its inherent characteristics such as linguistic properties. Previous works on text classification of Dark Web domain have suggested that the use of deep neural models may be ineffective, potentially due to the linguistic differences between the Dark and Surface Webs. However, not much work has been done to uncover the linguistic characteristics of the Dark Web. This paper introduces CoDA, a publicly available Dark Web dataset consisting of 10000 web documents tailored towards text-based Dark Web analysis. By leveraging CoDA, we conduct a thorough linguistic analysis of the Dark Web and examine the textual differences between the Dark Web and the Surface Web. We also assess the performance of various methods of Dark Web page classification. Finally, we compare CoDA with an existing public Dark Web dataset and evaluate their suitability for various use cases.
翻译:暗网的隐藏性质和有限可访问性,加上这一领域缺乏公共数据集,使得难以研究其内在特征,例如语言特性。以前关于暗网域文本分类的工作表明,深神经模型的使用可能无效,这可能是由于暗网和表面网之间的语言差异造成的。然而,在发现暗网语言特征方面没有做多少工作。本文介绍CoDA,这是一个公开提供的暗网数据集,由10 000个基于文本的暗网分析的网络文件组成。我们利用CoDA对暗网进行彻底的语言分析,并审查暗网和表面网之间的文字差异。我们还评估了黑网网页分类方法的性能。最后,我们将CoDA与现有的公开的黑暗网数据集进行比较,并评估其是否适合各种使用案例。