We analyze a large corpus of police incident narrative documents in understanding the spatial distribution of the topics. The motivation for doing this is that police narratives in each incident report contains very fine-grained information that is richer than the category that is manually assigned by the police. Our approach is to split the corpus into topics using two different unsupervised machine learning algorithms - Latent Dirichlet Allocation and Non-negative Matrix Factorization. We validate the performance of each learned topic model using model coherence. Then, using a k-nearest neighbors density ratio estimation (kNN-DRE) approach that we propose, we estimate the spatial density ratio per topic and use this for data discovery and analysis of each topic, allowing for insights into the described incidents at scale. We provide a qualitative assessment of each topic and highlight some key benefits for using our kNN-DRE model for estimating spatial trends.
翻译:我们分析大量的警方事件叙事文件,以了解专题的空间分布。这样做的动机是,每起事件报告中的警方叙事包含非常精细的信息,比警方人工分配的类别更丰富。我们的方法是使用两种不同的未经监督的机器学习算法(Lentnt Dirichlet分配和非负矩阵分化)将本案件分成不同的专题。我们使用模型一致性来验证每个已学专题模型的性能。然后,我们采用我们提议的 k- near 邻居密度比率估计(kNN-DRE)方法,我们估计每个专题的空间密度比率,并用这个方法对每个专题进行数据发现和分析,以便了解所描述的事件的规模。我们对每个专题进行定性评估,并强调使用我们的 kNN-DRE 模型来估计空间趋势的一些关键好处。</s>