使用字节包的 URL 位流分类 (Classification of URL bitstreams using Bag of Bytes)

Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other approaches try to use machine learning (ML) techniques by extracting features from URL strings. This approach can cover a wider area of Internet web sites, but finding good features requires deep knowledge of trends of web site design. Recently, another approach using deep learning (DL) has appeared. The DL approach will help to extract features automatically by investigating a lot of existing sample data. Using this technique, we can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain. In this paper, we apply a mechanical approach to generate feature vectors from URL strings. We implemented our approach and tested with realistic URL access history data taken from a research organization and data from the famous archive site of phishing site information, PhishTank.com. Our approach achieved 2~3% better accuracy compared to the existing DL-based approach.

翻译：保护用户不访问恶意网站是网络操作者的重要管理任务之一。保护用户不访问恶意网站是网络操作者的重要管理任务之一。有许多开放源码和商业产品可以控制网站用户可以访问。最传统的做法是基于黑名单的过滤。最传统的方法是基于黑名单的过滤。这个机制简单但不可扩展, 尽管有一些使用模糊匹配技术的强化方法。其他方法试图使用机器学习(ML)技术, 从 URL 字符串中提取特性。这种方法可以覆盖更广泛的互联网网站内容, 但找到良好的特征需要深入了解网站设计趋势。最近, 出现了另一种使用深层次学习( DL) 的方法。 DL 方法将有助于通过调查大量现有样本数据来自动提取特征。使用这一方法,我们可以建立一个灵活的过滤决定模块, 向神经网络模块教授最新趋势, 而不对 URL 域域有任何具体专家知识。在本文中, 我们采用机械方法从 URL 字符串中生成特性矢量。我们采用了我们的方法, 并用现实的 URL 访问历史数据测试了我们的方法, 从一个研究机构和著名的档案站点信息中的数据, PhishTank- k.com。我们的方法实现了2-% 比较现有的精确到D3 。