This paper presents RISC, an open-source Python package data generator (https://github.com/GRAAL-Research/risc). RISC generates look-alike automobile insurance contracts based on the Quebec regulatory insurance form in French and English. Insurance contracts are 90 to 100 pages long and use complex legal and insurance-specific vocabulary for a layperson. Hence, they are a much more complex class of documents than those in traditional NLP corpora. Therefore, we introduce RISCBAC, a Realistic Insurance Synthetic Bilingual Automobile Contract dataset based on the mandatory Quebec car insurance contract. The dataset comprises 10,000 French and English unannotated insurance contracts. RISCBAC enables NLP research for unsupervised automatic summarisation, question answering, text simplification, machine translation and more. Moreover, it can be further automatically annotated as a dataset for supervised tasks such as NER
翻译:本文介绍了 RISC,一种生成数据的开源 Python 包 (https://github.com/GRAAL-Research/risc)。RISC 根据魁北克法规保险表格生成类似汽车保险合同的文件,并提供法语和英语两种语言版本。保险合同长达 90 至 100 页,并使用普通人难以理解的复杂法律和保险专业词汇。因此,这类文件比传统 NLP 语料库中的文件复杂得多。为此,我们介绍了 RISCBAC 数据集,一种基于魁北克强制汽车保险合同生成的逼真的双语保险合同数据集。该数据集包括 10,000 份法语和英语未注释的保险合同。RISCBAC 数据集能够开展 NLP 研究,涵盖无监督自动摘要、问答、文本简化、机器翻译等研究领域。此外,该数据集还可以进一步自动注释为可用于监督任务(如 NER 等)的数据集。