Data labels in the security field are frequently noisy, limited, or biased towards a subset of the population. As a result, commonplace evaluation methods such as accuracy, precision and recall metrics, or analysis of performance curves computed from labeled datasets do not provide sufficient confidence in the real-world performance of a machine learning (ML) model. This has slowed the adoption of machine learning in the field. In the industry today, we rely on domain expertise and lengthy manual evaluation to build this confidence before shipping a new model for security applications. In this paper, we introduce Firenze, a novel framework for comparative evaluation of ML models' performance using domain expertise, encoded into scalable functions called markers. We show that markers computed and combined over select subsets of samples called regions of interest can provide a robust estimate of their real-world performances. Critically, we use statistical hypothesis testing to ensure that observed differences-and therefore conclusions emerging from our framework-are more prominent than that observable from the noise alone. Using simulations and two real-world datasets for malware and domain-name-service reputation detection, we illustrate our approach's effectiveness, limitations, and insights. Taken together, we propose Firenze as a resource for fast, interpretable, and collaborative model development and evaluation by mixed teams of researchers, domain experts, and business owners.
翻译:安全领域的数据标签往往噪音、有限或偏向于部分人口,因此,从标签数据集计算出的准确性、准确性和召回度量或性能曲线分析等常见评价方法,如精确性、精确性和召回度,或分析从标签数据集计算出来的性能曲线,不能对机器学习模型的实际世界性表现产生足够的信心。这减缓了在外地采用机器学习模型的速度。在当今的行业,我们依靠域内专门知识和冗长的手工评估来建立这种信任,然后运输一个新的安全应用模型。在本文中,我们引入了Fiernze,这是一个利用域内专门知识比较ML模型业绩的新框架,将它编码成可缩缩缩缩的功能。我们显示,计算和合并的标码对标称的样品区域的具体子能够提供对其实际性业绩的可靠估计。 关键地,我们使用统计假设测试来确保我们框架所观察到的差异和因此得出的结论比仅仅从噪音中观察到的要更加突出。 使用模拟和两个真实世界数据集,用于恶意和域名词的识别,我们共同提出一个可以理解的域域际评估,我们的方法,以及资源开发团队。