Big data methods are becoming an important tool for tax fraud detection around the world. Unsupervised learning approach is the dominant framework due to the lack of label and ground truth in corresponding data sets although these methods suffer from low interpretability. HUNOD, a novel hybrid unsupervised outlier detection method for tax evasion risk management, is presented in this paper. In contrast to previous methods proposed in the literature, the HUNOD method combines two outlier detection approaches based on two different machine learning designs (i.e, clustering and representational learning) to detect and internally validate outliers in a given tax dataset. The HUNOD method allows its users to incorporate relevant domain knowledge into both constituent outlier detection approaches in order to detect outliers relevant for a given economic context. The interpretability of obtained outliers is achieved by training explainable-by-design surrogate models over results of unsupervised outlier detection methods. The experimental evaluation of the HUNOD method is conducted on two datasets derived from the database on individual personal income tax declarations collected by the Tax Administration of Serbia. The obtained results show that the method indicates between 90% and 98% internally validated outliers depending on the clustering configuration and employed regularization mechanisms for representational learning.
翻译:大数据方法正在成为世界各地发现税务欺诈的一个重要工具。由于相应的数据集缺乏标签和地面真实性,因此无人监督的学习方法是主导框架,尽管这些方法具有低可解释性,但缺乏相应的数据集中的标签和地面真实性。本文件介绍的是HUNOD,这是为逃税风险管理而采用的一种新型混合、不受监督的外部检测混合方法。与文献中建议的以往方法不同,HUNOD方法将基于两种不同机器学习设计(即集聚和代表性学习)的两种异常检测方法结合起来,以探测和内部验证特定税收数据集中的个人收入申报。HUNOD方法允许用户将相关领域知识纳入两个组成外部检测方法中,以便发现与特定经济环境相关的外部数据。获得的外部数据通过培训逐个设计对未经监督的外部检测方法的结果进行解释。HUNOD方法的实验性评价是根据塞尔维亚税务管理局收集的个人个人收入税申报数据库中得出的两个数据集进行的。获得的结果显示,该方法显示在90%和98 %的内部代表机制中采用了正规化。