Major scandals in corporate history have urged the need for regulatory compliance, where organizations need to ensure that their controls (processes) comply with relevant laws, regulations, and policies. However, keeping track of the constantly changing legislation is difficult, thus organizations are increasingly adopting Regulatory Technology (RegTech) to facilitate the process. To this end, we introduce regulatory information retrieval (REG-IR), an application of document-to-document information retrieval (DOC2DOC IR), where the query is an entire document making the task more challenging than traditional IR where the queries are short. Furthermore, we compile and release two datasets based on the relationships between EU directives and UK legislation. We experiment on these datasets using a typical two-step pipeline approach comprising a pre-fetcher and a neural re-ranker. Experimenting with various pre-fetchers from BM25 to k nearest neighbors over representations from several BERT models, we show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR. We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels. Thus, they are biased towards the pre-fetcher's score. Interestingly, applying a date filter further improves the performance, showcasing the importance of the time dimension.
翻译:公司历史上的重大丑闻敦促遵守规章,因为各组织需要确保其管制(程序)符合有关的法律、规章和政策;然而,跟踪不断变化的立法是困难的,因此各组织越来越多地采用监管技术(Regtech)来推动这一过程;为此,我们采用监管信息检索(REG-IR),即文件到文件信息检索的应用(DOC2DOC IR),查询是整个文件,使得任务比传统的内部档案(IR)更具有挑战性;此外,我们根据欧盟指令与联合王国立法之间的关系汇编和发布两个数据集。我们用典型的双步管道方法对这些数据集进行实验,其中包括预选和神经重新排位。我们用各种预扩展器进行实验,从BM25到最近的邻居进行文件检索(DOC2DOC IR),在几个BERT模型的演示中,我们展示了在内部分类任务方面对BERT模型的微调,为IRA提供了最佳的表述。我们还显示,根据欧盟指令与联合王国立法之间的关系,我们用典型的神经收缩器重新排列了两个阶段的数据集,以反级的评分级日期。