Protecting privileged communications and data from inadvertent disclosure is a paramount task in the US legal practice. Traditionally counsels rely on keyword searching and manual review to identify privileged documents in cases. As data volumes increase, this approach becomes less and less defensible in costs. Machine learning methods have been used in identifying privilege documents. Given the generalizable nature of privilege in legal cases, we hypothesize that transfer learning can capitalize knowledge learned from existing labeled data to identify privilege documents without requiring labeling new training data. In this paper, we study both traditional machine learning models and deep learning models based on BERT for privilege document classification tasks in legal document review, and we examine the effectiveness of transfer learning in privilege model on three real world datasets with privilege labels. Our results show that BERT model outperforms the industry standard logistic regression algorithm and transfer learning models can achieve decent performance on datasets in same or close domains.
翻译:保护特权通信和数据不被无意披露是美国法律惯例中的一项首要任务。传统上,律师依靠关键词搜索和人工审查来识别案件中的特权文件。随着数据量的增加,这种做法在成本上变得越来越少,越来越难以辩护。机器学习方法被用于确定特权文件。鉴于特权在法律案件中具有普遍适用的性质,我们假设转让学习能够利用从现有标签数据中获取的知识来识别特权文件,而不需要为新的培训数据贴标签。在本文中,我们研究了传统机器学习模式和基于BERT的深层次学习模式,以在法律文件审查中确定特权文件分类任务。我们研究了特权模式中三个真实世界数据集的特权学习在特权标签上的有效性。我们的结果表明,BERT模型超越了行业标准的物流回归算法和转让学习模式,可以在相同或近距离的域内实现数据集的体面表现。