与附加和增加的账单公司 (Learning Bill Similarity with Annotated and Augmented Corpora of Bills)

Bill writing is a critical element of representative democracy. However, it is often overlooked that most legislative bills are derived, or even directly copied, from other bills. Despite the significance of bill-to-bill linkages for understanding the legislative process, existing approaches fail to address semantic similarities across bills, let alone reordering or paraphrasing which are prevalent in legal document writing. In this paper, we overcome these limitations by proposing a 5-class classification task that closely reflects the nature of the bill generation process. In doing so, we construct a human-labeled dataset of 4,721 bill-to-bill relationships at the subsection-level and release this annotated dataset to the research community. To augment the dataset, we generate synthetic data with varying degrees of similarity, mimicking the complex bill writing process. We use BERT variants and apply multi-stage training, sequentially fine-tuning our models with synthetic and human-labeled datasets. We find that the predictive performance significantly improves when training with both human-labeled and synthetic data. Finally, we apply our trained model to infer section- and bill-level similarities. Our analysis shows that the proposed methodology successfully captures the similarities across legal documents at various levels of aggregation.

翻译：法案书写是代议制民主的一个关键要素。然而,人们常常忽视的是,大多数立法法案是从其他法案衍生出来的,甚至直接复制的。尽管法案到法案之间的关联对于理解立法过程具有重要意义,但现有方法未能解决法案之间的语义相似性,更不用说重订顺序或翻转了,这在法律文件撰写过程中十分普遍。在本文件中,我们通过提出一个5级分类任务来克服这些限制,该分类任务密切反映了法案产生过程的性质。在这样做时,我们在分级一级建立了一个4,721个账单到账单的关系的人类标签数据集,并向研究界发放了这个附加说明的数据集。为了扩大数据集,我们生成了不同程度相似的合成数据,模仿复杂的法案书写过程。我们使用BERT变量并采用多阶段培训,按顺序调整我们的模型,以合成和人类标签的数据集为基础。我们发现,在使用人标和合成数据进行的培训时,预测性业绩会显著改善。最后,我们运用我们经过培训的模型来推断在各种层次的类似性法律文件的相似性。我们的分析显示,在各种层次上,我们提出的方法中成功地展示了建议。