法律文件与数据增强数据联合分割和外心作用标签 (Joint Span Segmentation and Rhetorical Role Labeling with Data Augmentation for Legal Documents)

Segmentation and Rhetorical Role Labeling of legal judgements play a crucial role in retrieval and adjacent tasks, including case summarization, semantic search, argument mining etc. Previous approaches have formulated this task either as independent classification or sequence labeling of sentences. In this work, we reformulate the task at span level as identifying spans of multiple consecutive sentences that share the same rhetorical role label to be assigned via classification. We employ semi-Markov Conditional Random Fields (CRF) to jointly learn span segmentation and span label assignment. We further explore three data augmentation strategies to mitigate the data scarcity in the specialized domain of law where individual documents tend to be very long and annotation cost is high. Our experiments demonstrate improvement of span-level prediction metrics with a semi-Markov CRF model over a CRF baseline. This benefit is contingent on the presence of multi sentence spans in the document.

翻译：法律判决的分解和分流作用标签在检索和相邻任务(包括案件汇总、语义搜索、辩证挖掘等)中发挥着关键作用。以前的做法已经将这项任务作为独立的分类或判决顺序标签来制定。在这项工作中,我们重新拟订跨段一级的任务,以确定多个连续句的跨度,这些句子与通过分类指定的词性角色标签相同。我们采用半马尔科夫有条件随机字段(CRF),共同学习跨段分解和跨标签分配。我们进一步探讨三种数据扩充战略,以缓解专门法律领域的数据稀缺情况,因为具体文件往往非常长,注注费用很高。我们的实验表明,在通用报告格式基线上,半马尔科夫通用报告格式模型的跨段级预测指标有所改善。这一益处取决于文件中存在多个句子。