Phishing kits are tools that dark side experts provide to the community of criminal phishers to facilitate the construction of malicious Web sites. As these kits evolve in sophistication, providers of Web-based services need to keep pace with continuous complexity. We present an original classification of a corpus of over 2000 recent phishing kits according to their adopted evasion and obfuscation functions. We carry out an initial deterministic analysis of the source code of the kits to extract the most discriminant features and information about their principal authors. We then integrate this initial classification through supervised machine learning models. Thanks to the ground-truth achieved in the first step, we can demonstrate whether and which machine learning models are able to suitably classify even the kits adopting novel evasion and obfuscation techniques that were unseen during the training phase. We compare different algorithms and evaluate their robustness in the realistic case in which only a small number of phishing kits are available for training. This paper represents an initial but important step to support Web service providers and analysts in improving early detection mechanisms and intelligence operations for the phishing kits that might be installed on their platforms.
翻译:钓鱼工具包是黑暗侧面专家提供给罪犯耳塞社区的工具,用于便利恶意网站的建造。随着这些工具包的先进性,网络服务供应商需要跟上持续的复杂性。我们根据它们被采纳的逃避和迷惑功能,对2000年以上的最新钓鱼工具包进行了原始分类。我们对这些工具包的来源代码进行了初步确定性分析,以提取最不相干的特点和关于其主要作者的信息。我们随后通过监督的机器学习模型将这一初步分类纳入其中。由于第一步实现了地面真相,我们可以证明,我们是否能够和哪些机器学习模型能够适当分类,甚至将培训阶段所看不到的采用的新规避和迷惑技术的工具包分类。我们比较了不同的算法,并评估了它们是否在现实情况下的强健性,因为实际情况下只有少量的网友工具包可供培训使用。本文代表了支持网络服务提供者和分析人员改进早期检测机制以及可能安装在平台上的网格工具包的智能操作的一个初步但重要的步骤。