In this paper, we present our approaches for the case law retrieval and the legal case entailment task in the Competition on Legal Information Extraction/Entailment (COLIEE) 2021. As first stage retrieval methods combined with neural re-ranking methods using contextualized language models like BERT achieved great performance improvements for information retrieval in the web and news domain, we evaluate these methods for the legal domain. A distinct characteristic of legal case retrieval is that the query case and case description in the corpus tend to be long documents and therefore exceed the input length of BERT. We address this challenge by combining lexical and dense retrieval methods on the paragraph-level of the cases for the first stage retrieval. Here we demonstrate that the retrieval on the paragraph-level outperforms the retrieval on the document-level. Furthermore the experiments suggest that dense retrieval methods outperform lexical retrieval. For re-ranking we address the problem of long documents by summarizing the cases and fine-tuning a BERT-based re-ranker with the summaries. Overall, our best results were obtained with a combination of BM25 and dense passage retrieval using domain-specific embeddings.
翻译:在本文中,我们介绍了在2021年法律信息提取/升级竞争(COLIEE)中我们处理案例法检索和法律案件所涉任务的方法。作为第一阶段的检索方法,加上使用背景化语言模型(如BERT)的神经重新排序方法,在网上和新闻领域极大地改进了信息检索的性能,我们评估了这些法律领域的方法。法律案件检索的一个明显特点是,在实体中的查询案例和案件描述往往为长的文件,因此超过了BERT的输入长度。我们通过将第一阶段检索案件段落级的词汇和密集检索方法结合起来来应对这一挑战。我们在这里表明,在段落级的检索方法比文件级的检索效果要好。此外,实验还表明,密集检索方法超越了格式化的检索。对于长文件的重新排序,我们通过对案例进行总结和对基于BERT的重新排列器进行微调,从而解决长文件问题。总体而言,我们取得的最佳结果是结合BM25和密集的通过具体域嵌嵌式检索。