用于检测数据过滤的机器学习:回顾 (Machine Learning for Detecting Data Exfiltration: A Review)

Context: Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks. It is important to systematically review and synthesize the ML-based data exfiltration countermeasures for building a body of knowledge on this important topic. Objective: This paper aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures. This review also aims at identifying gaps in research on ML-based data exfiltration countermeasures. Method: We used a Systematic Literature Review (SLR) method to select and review {92} papers. Results: The review has enabled us to (a) classify the ML approaches used in the countermeasures into data-driven, and behaviour-driven approaches, (b) categorize features into six types: behavioural, content-based, statistical, syntactical, spatial and temporal, (c) classify the evaluation datasets into simulated, synthesized, and real datasets and (d) identify 11 performance measures used by these studies. Conclusion: We conclude that: (i) the integration of data-driven and behaviour-driven approaches should be explored; (ii) There is a need of developing high quality and large size evaluation datasets; (iii) Incremental ML model training should be incorporated in countermeasures; (iv) resilience to adversarial learning should be considered and explored during the development of countermeasures to avoid poisoning attacks; and (v) the use of automated feature engineering should be encouraged for efficiently detecting data exfiltration attacks.

翻译：在网络安全、机器学习和软件工程(SE)的交叉点上进行研究,最近采取了重大步骤,建议采取反措施,以发现复杂的数据过滤攻击,必须系统地审查和综合基于ML的数据过滤对策,以建立关于这一重要主题的知识库。目标:本文件旨在系统地审查基于ML的数据过滤对策,以查明和分类用于这些对策的ML方法、特征工程技术、评价数据集和性能衡量标准。这次审查还旨在查明基于ML的数据过滤措施研究中的差距。方法:我们使用系统文学审查(SLR)的方法来选择和审查{92}文件。结果:审查使我们能够(a) 将反措施中使用的ML方法分为数据驱动和行为驱动方法,(b) 将特征分为六类:行为、内容、统计、模型、空间和时间尺度。 (c) 将评价数据集分类为模拟、综合和真实数据过滤措施。我们采用系统文学审查方法来选择和审查{92}论文。结果:审查使我们得以(d) 将用于反措施的ML方法归类为数据驱动,(b) 在进行这些研究期间,应采用高水平数据分析期间,(b) 研究期间,应采用高度和高压数据分析; (b) 研究期间应采用高压方法。