Credit cards play an exploding role in modern economies. Its popularity and ubiquity have created a fertile ground for fraud, assisted by the cross boarder reach and instantaneous confirmation. While transactions are growing, the fraud percentages are also on the rise as well as the true cost of a dollar fraud. Volume of transactions, uniqueness of frauds and ingenuity of the fraudster are main challenges in detecting frauds. The advent of machine learning, artificial intelligence and big data has opened up new tools in the fight against frauds. Given past transactions, a machine learning algorithm has the ability to 'learn' infinitely complex characteristics in order to identify frauds in real-time, surpassing the best human investigators. However, the developments in fraud detection algorithms has been challenging and slow due the massively unbalanced nature of fraud data, absence of benchmarks and standard evaluation metrics to identify better performing classifiers, lack of sharing and disclosure of research findings and the difficulties in getting access to confidential transaction data for research. This work investigates the properties of typical massively imbalanced fraud data sets, their availability, suitability for research use while exploring the widely varying nature of fraud distributions. Furthermore, we show how human annotation errors compound with machine classification errors. We also carry out experiments to determine the effect of PCA obfuscation (as a means of disseminating sensitive transaction data for research and machine learning) on algorithmic performance of classifiers and show that while PCA does not significantly degrade performance, care should be taken to use the appropriate principle component size (dimensions) to avoid overfitting.
翻译:信用卡在现代经济中发挥着爆炸作用。它的受欢迎性和普遍性为欺诈创造了肥沃的土壤,它得到了交叉登机和即时确认的帮助。虽然交易在增加,但欺诈率也在上升,美元欺诈的真实成本也在增加。交易量、欺诈的独特性和欺诈者独有的智慧是发现欺诈的主要挑战。机器学习、人工智能和大数据的出现为打击欺诈开辟了新的工具。鉴于过去的交易,机器学习算法能够“流出”无限复杂的特点,以便查明实时欺诈,超过最好的人体调查人员。然而,欺诈检测算法的发展具有挑战性和缓慢性,因为欺诈数据性质极不平衡,缺乏基准和标准评价衡量标准来查明更好的叙级、缺乏共享和披露研究结果,以及难以获得机密交易数据以进行研究。鉴于过去的交易,机器学习算法的准确性应该用来调查典型的大规模不平衡的欺诈数据集的特性、其可用性、研究的适宜性,同时探索广泛不同的货币交易的特性。此外,我们通过机器的分类,展示了一种操作性分析方法,从而可以明显地评估货币交易的等级错误。此外,我们展示了一种机器数据的分类方法,我们展示了如何使用。