在Haystack中寻找Phish:关于证书透明日志的钓鱼分类管道 (Finding Phish in a Haystack: A Pipeline for Phishing Classification on Certificate Transparency Logs)

Current popular phishing prevention techniques mainly utilize reactive blocklists, which leave a ``window of opportunity'' for attackers during which victims are unprotected. One possible approach to shorten this window aims to detect phishing attacks earlier, during website preparation, by monitoring Certificate Transparency (CT) logs. Previous attempts to work with CT log data for phishing classification exist, however they lack evaluations on actual CT log data. In this paper, we present a pipeline that facilitates such evaluations by addressing a number of problems when working with CT log data. The pipeline includes dataset creation, training, and past or live classification of CT logs. Its modular structure makes it possible to easily exchange classifiers or verification sources to support ground truth labeling efforts and classifier comparisons. We test the pipeline on a number of new and existing classifiers, and find a general potential to improve classifiers for this scenario in the future. We publish the source code of the pipeline and the used datasets along with this paper (https://gitlab.com/rwth-itsec/ctl-pipeline), thus making future research in this direction more accessible.

翻译：目前流行的钓鱼预防技术主要使用反应式的阻截剂,给攻击者留下一个“机会之窗”,受害者在其中得不到保护。一种可能的缩短这一窗口的方法是,在网站准备期间,通过监测证书透明(CT)日志,更早地通过监测网站记录来发现钓鱼攻击。以前曾尝试过利用CT日志数据进行钓鱼分类,但是它们缺乏对实际CT日志数据的评价。在本文件中,我们提出了一个管道,通过在使用CT日志数据时解决若干问题来便利这种评价。管道包括CT日志的数据集的创建、培训和过去或现在的分类。其模块化结构使得能够方便地交换分类或核查来源,以支持地面的真相标签工作和分类比较。我们测试了一些新的和现有的分类者,并发现将来改进这种情景的分类者的一般潜力。我们公布了管道的来源代码和使用的数据集,连同这份文件一起(https://gitlab.com/rwth-itec/ctl-stipline),从而使得今后能够更方便地进行这一方向的研究。