The growing complexity of legal cases has lead to an increasing interest in legal information retrieval systems that can effectively satisfy user-specific information needs. However, such downstream systems typically require documents to be properly formatted and segmented, which is often done with relatively simple pre-processing steps, disregarding topical coherence of segments. Systems generally rely on representations of individual sentences or paragraphs, which may lack crucial context, or document-level representations, which are too long for meaningful search results. To address this issue, we propose a segmentation system that can predict topical coherence of sequential text segments spanning several paragraphs, effectively segmenting a document and providing a more balanced representation for downstream applications. We build our model on top of popular transformer networks and formulate structural text segmentation as topical change detection, by performing a series of independent classifications that allow for efficient fine-tuning on task-specific data. We crawl a novel dataset consisting of roughly $74,000$ online Terms-of-Service documents, including hierarchical topic annotations, which we use for training. Results show that our proposed system significantly outperforms baselines, and adapts well to structural peculiarities of legal documents. We release both data and trained models to the research community for future work.https://github.com/dennlinger/TopicalChange
翻译:法律案件日趋复杂,导致人们日益关注能够有效满足用户具体信息需求的法律信息检索系统,然而,这类下游系统通常要求文件格式和条块分割,往往采用相对简单的处理前步骤,而忽视各部分的时时性一致性。系统一般依赖单个句或段落的表述,这些句或段落可能缺乏关键背景,或文件一级的表述,而对于有意义的搜索结果来说,时间太长,因此难以取得有意义的搜索结果。为解决这一问题,我们建议了一个分解系统,可以预测跨越若干段落的连续文字部分的时序一致性,有效地对文件进行分解,并为下游应用提供一个更加平衡的代表性代表。我们把模型建在流行变压器网络顶部,并将结构文本分割成结构文本,作为时时时变化探测。我们通过进行一系列独立的分类,以便对具体任务的数据进行高效的微调。我们收集了一套新数据集,由大约74,000美元的在线《服务条款》文件组成,包括我们用于培训的分级专题说明。结果显示,我们提议的系统大大超出基线,并适应法律文件的结构特性。我们向社区发布数据和经过培训的模型,供未来工作使用。