The California Innocence Project (CIP), a clinical law school program aiming to free wrongfully convicted prisoners, evaluates thousands of mails containing new requests for assistance and corresponding case files. Processing and interpreting this large amount of information presents a significant challenge for CIP officials, which can be successfully aided by topic modeling techniques.In this paper, we apply Non-negative Matrix Factorization (NMF) method and implement various offshoots of it to the important and previously unstudied data set compiled by CIP. We identify underlying topics of existing case files and classify request files by crime type and case status (decision type). The results uncover the semantic structure of current case files and can provide CIP officials with a general understanding of newly received case files before further examinations. We also provide an exposition of popular variants of NMF with their experimental results and discuss the benefits and drawbacks of each variant through the real-world application.
翻译:加利福尼亚省无罪项目(CIP)是一个临床法学院方案,旨在释放被错误定罪的囚犯,评估数千封载有新的援助请求和相应案件档案的邮件。处理和解释大量信息对CIP官员来说是一个重大挑战,可以通过专题示范技术来成功地帮助他们。 在本文件中,我们对CIP汇编的重要和以前未经研究的数据采用非否定矩阵系数化方法(NMF),并采用该方法的各种分支。我们确定现有案件档案的基本主题,并按犯罪类型和案件状况(决定类型)对请求档案进行分类。结果揭示了当前案件档案的语义结构,并使CIP官员在进一步审查之前能够对新收到的案件档案有一个普遍理解。我们还对NMF的流行变量及其实验结果进行介绍,并讨论通过实际应用对每种变式的利弊。