This article describes techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations about driver's claims experience together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process using machine learning algorithms. The first stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The second stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. In the final stage, a synthetic portfolio of the space of feature variables is generated applying an extended $\texttt{SMOTE}$ algorithm. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work valuable.
翻译:本文介绍在制作类似真实保险数据集所效仿的驱动器远程数据合成数据集时采用的技术。合成数据集产生的十万项政策,其中包括对驱动器索赔经验的观察,以及相关的古典风险变量和远程信息变量。这项工作旨在产生一种资源,用于推进模型评估基于使用保险的风险。它遵循一个使用机器学习算法的三阶段过程。第一阶段是模拟索赔数量的数值,作为多个二进制分类,应用向神经网络提供反馈。第二阶段是模拟索赔总量的数值,以利用进料向神经网络进行回归,并将索赔数量纳入成套地物变量。在最后阶段,利用一个扩展的 $\ textt{SMOTE}算法生成一个功能变量空间空间的合成组合。由此产生的数据集通过比较合成和真实数据集,在Poisson和伽马回归模型适合相关数据时,对合成和真实数据集进行对比。其他可视化和数据总和生成两个数据集之间惊人的类似统计数据。我们希望对获取远程数据分析模型感兴趣的研究人员将学习如何校准。