司机电信数学的合成数据集生成 (Synthetic Dataset Generation of Driver Telematics)

This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process using machine learning algorithms. The first stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The second stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. In the final stage, a synthetic portfolio of the space of feature variables is generated applying an extended $\texttt{SMOTE}$ algorithm. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work valuable.

翻译：本文介绍在制作类似真实保险数据集所效仿的驱动器远程数据合成数据集时采用的技术。合成数据集产生的十万项政策,其中包括对驱动器索赔经验的观察,以及相关的古典风险变量和远程信息变量。这项工作旨在产生一种资源,用于推进模型评估基于使用保险的风险。它遵循一个使用机器学习算法的三阶段过程。第一阶段是模拟索赔数量的数值,作为多个二进制分类,应用向神经网络提供反馈。第二阶段是模拟索赔总量的数值,以利用进料向神经网络进行回归,并将索赔数量纳入成套地物变量。在最后阶段,利用一个扩展的 $\ textt{SMOTE}算法生成一个功能变量空间空间的合成组合。由此产生的数据集通过比较合成和真实数据集,在Poisson和伽马回归模型适合相关数据时,对合成和真实数据集进行对比。其他可视化和数据总和生成两个数据集之间惊人的类似统计数据。我们希望对获取远程数据分析模型感兴趣的研究人员将学习如何校准。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Gartner：2020年十大战略性技术趋势, 47页pdf

专知会员服务

79+阅读 · 2020年3月10日