对审计数据和分类编码的影响进行异常探测 (Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings)

In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent insurance claims for automotive repairs. The data belongs to the more broad category of Auditing data, which includes also Journals and Network Intrusion data. Insurance claim data are distinctively different from other auditing data (such as network intrusion data) in their high number of categorical attributes. We tackle the common problem of missing benchmark datasets for anomaly detection: datasets are mostly confidential, and the public tabular datasets do not contain relevant and sufficient categorical attributes. Therefore, a large-sized dataset is created for this purpose and referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow and deep learning methods. Due to the introduction of categorical attributes, we encounter the challenge of encoding them for the large dataset. As One Hot encoding of high cardinal dataset invokes the "curse of dimensionality", we experiment with GEL encoding and embedding layer for representing categorical attributes. Our work compares competitive learning, reconstruction-error, density estimation and contrastive learning approaches for Label, One Hot, GEL encoding and embedding layer to handle categorical values.

翻译：在本文中,我们介绍车辆索赔数据集,由汽车修理保险欺诈性索赔组成。数据属于更广泛的审计数据类别,包括日记和网络入侵数据。保险索赔数据与其他审计数据(如网络入侵数据)有明显不同,因为它们具有大量绝对属性。我们处理的是为检测异常而缺少基准数据集这一共同问题:数据集大多是保密的,公开表格数据集并不包含相关和足够的绝对属性。因此,为此创建了一个大数据集,称为车辆索赔数据集。数据集是用浅深学习方法进行评估的。由于采用了绝对属性,我们面临为大型数据集编码这些数据的挑战。作为高红源数据集的一个热编码,我们用GEL编码和嵌入层进行实验,以代表绝对属性。我们的工作比较了拉贝尔(Label)、一个热、GEL编码和嵌入层的竞争性学习、重整、密度估计和对比学习方法,以便处理绝对值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日