New malware emerges at a rapid pace and often incorporates Domain Generation Algorithms (DGAs) to avoid blocking the malware's connection to the command and control (C2) server. Current state-of-the-art classifiers are able to separate benign from malicious domains (binary classification) and attribute them with high probability to the DGAs that generated them (multiclass classification). While binary classifiers can label domains of yet unknown DGAs as malicious, multiclass classifiers can only assign domains to DGAs that are known at the time of training, limiting the ability to uncover new malware families. In this work, we perform a comprehensive study on the detection of new DGAs, which includes an evaluation of 59,690 classifiers. We examine four different approaches in 15 different configurations and propose a simple yet effective approach based on the combination of a softmax classifier and regular expressions (regexes) to detect multiple unknown DGAs with high probability. At the same time, our approach retains state-of-the-art classification performance for known DGAs. Our evaluation is based on a leave-one-group-out cross-validation with a total of 94 DGA families. By using the maximum number of known DGAs, our evaluation scenario is particularly difficult and close to the real world. All of the approaches examined are privacy-preserving, since they operate without context and exclusively on a single domain to be classified. We round up our study with a thorough discussion of class-incremental learning strategies that can adapt an existing classifier to newly discovered classes.
翻译:新恶意软件以快速的速度出现,并且往往包含 Domain DGA Alogorithms (DGAs), 以避免阻止恶意软件与指挥和控制服务器(C2) 的连接。 目前最先进的分类方法能够将良性与恶意域( 双级分类) 分离, 并将其高度概率归属生成的 DGA 组合( 多级分类 ) 。 虽然二进制分类方法可以将未知的DGA 域标为恶意、 多级分类方法, 只能将已知的培训时已知的DGA 域指定为DGA 域, 从而限制发现新恶意软件家庭的能力。 在这项工作中, 我们进行关于发现新DGA 的高级分类方法的全面研究, 其中包括对59 690 个分类器进行评估。 我们用15种不同的配置来检查四种不同的方法, 并提议一个简单而有效的方法, 以软式的摩擦分类器和常规表达方式( regexes) 组合, 来检测多个未知的 DGAA 。 同时, 我们的方法可以保留已知DGA 的全局的全局的全局的全局 和全局的跨级的跨级评估方法。