With growing credit card transaction volumes, the fraud percentages are also rising, including overhead costs for institutions to combat and compensate victims. The use of machine learning into the financial sector permits more effective protection against fraud and other economic crime. Suitably trained machine learning classifiers help proactive fraud detection, improving stakeholder trust and robustness against illicit transactions. However, the design of machine learning based fraud detection algorithms has been challenging and slow due the massively unbalanced nature of fraud data and the challenges of identifying the frauds accurately and completely to create a gold standard ground truth. Furthermore, there are no benchmarks or standard classifier evaluation metrics to measure and identify better performing classifiers, thus keeping researchers in the dark. In this work, we develop a theoretical foundation to model human annotation errors and extreme imbalance typical in real world fraud detection data sets. By conducting empirical experiments on a hypothetical classifier, with a synthetic data distribution approximated to a popular real world credit card fraud data set, we simulate human annotation errors and extreme imbalance to observe the behavior of popular machine learning classifier evaluation matrices. We demonstrate that a combined F1 score and g-mean, in that specific order, is the best evaluation metric for typical imbalanced fraud detection model classification.
翻译:随着信用卡交易量的增加,欺诈率也在上升,包括打击和补偿受害者的机构的间接费用。在金融部门使用机器学习使更有效地防范欺诈和其他经济犯罪。经过适当培训的机器学习分类有助于主动发现欺诈,提高利益攸关方的信任和防范非法交易的稳健性。然而,由于欺诈数据的巨大不平衡性质,以及准确和彻底查明欺诈以创造金质标准地面真理的挑战,基于机器学习的欺诈检测算法的设计具有挑战性和缓慢性。此外,没有基准或标准分类评估指标来衡量和确定更好的业绩分类人员,从而使研究人员处于黑暗状态。在这项工作中,我们开发了理论基础,以模拟真实世界欺诈检测数据集中典型的人类批注错误和极端不平衡。通过对假设分类进行实验性数据分配,与世界通用真实信用卡欺诈数据集相近,我们模拟人类的批注错误和极端失衡,以观察流行机器学习分类评价矩阵的行为。我们证明,在这一具体顺序中,将F1分和g分组合,是典型欺诈检测模型的最佳衡量标准。