汽车保险政策风险评估的数据科学方法 (A Data Science Approach to Risk Assessment for Automobile Insurance Policies)

In order to determine a suitable automobile insurance policy premium one needs to take into account three factors, the risk associated with the drivers and cars on the policy, the operational costs associated with management of the policy and the desired profit margin. The premium should then be some function of these three values. We focus on risk assessment using a Data Science approach. Instead of using the traditional frequency and severity metrics we instead predict the total claims that will be made by a new customer using historical data of current and past policies. Given multiple features of the policy (age and gender of drivers, value of car, previous accidents, etc.) one can potentially try to provide personalized insurance policies based specifically on these features as follows. We can compute the average claims made per year of all past and current policies with identical features and then take an average over these claim rates. Unfortunately there may not be sufficient samples to obtain a robust average. We can instead try to include policies that are "similar" to obtain sufficient samples for a robust average. We therefore face a trade-off between personalization (only using closely similar policies) and robustness (extending the domain far enough to capture sufficient samples). This is known as the Bias-Variance Trade-off. We model this problem and determine the optimal trade-off between the two (i.e. the balance that provides the highest prediction accuracy) and apply it to the claim rate prediction problem. We demonstrate our approach using real data.

翻译：为了确定适当的汽车保险政策保费,需要考虑到三个因素:政策上与司机和汽车有关的风险、与政策管理有关的业务费用和理想利润幅度。然后,保费应是这三个价值的某些功能。我们注重利用数据科学方法进行风险评估。我们不是使用传统的频率和严重程度衡量标准,而是用目前和过去政策的历史数据预测新客户将提出的全部索赔要求。鉴于该政策有多种特点(驾驶员的年龄和性别、汽车价值、以往事故等),人们可能试图提供具体基于以下这些特点的个人化保险政策。我们可以计算所有过去和目前政策每年提出的具有相同特点的平均索赔要求,然后以这些索赔率的平均值为平均值。不幸的是,可能没有足够的样本来获得稳健的平均值。我们可以试图纳入“相似”的政策,以获得稳健的平均值所需的足够样品。因此,我们面临着个人化(仅使用非常相似的政策)和稳健(超出足够长的域域,足以获取足够的样本)之间的交易保单。我们了解的是,这是最佳的贸易率和最佳的预测。我们要用最佳的预测。