General Data Protection Regulation (GDPR) becomes a standard law for data protection in many countries. Currently, twelve countries adopt the regulation and establish their GDPR-like regulation. However, to evaluate the differences and similarities of these GDPR-like regulations is time-consuming and needs a lot of manual effort from legal experts. Moreover, GDPR-like regulations from different countries are written in their languages leading to a more difficult task since legal experts who know both languages are essential. In this paper, we investigate a simple natural language processing (NLP) approach to tackle the problem. We first extract chunks of information from GDPR-like documents and form structured data from natural language. Next, we use NLP methods to compare documents to measure their similarity. Finally, we manually label a small set of data to evaluate our approach. The empirical result shows that the BERT model with cosine similarity outperforms other baselines. Our data and code are publicly available.
翻译:一般数据保护条例(GDPR)在许多国家成为数据保护的标准法律。目前,有12个国家通过了该条例,并建立了类似于GDPR的条例。然而,为了评估这些类似于GDPR的条例的差异和相似之处,我们花费时间,需要法律专家作出大量人工努力。此外,不同国家的类似GDPR的条例以其语言写成,导致更困难的任务,因为知道两种语言的法律专家都了解这两种语言。在本文件中,我们调查一种简单的自然语言处理(NLP)方法来解决这个问题。我们首先从类似于GDPR的文件中提取大量信息,并从自然语言中形成结构化数据。接下来,我们使用NLP方法比较文件以衡量其相似性。最后,我们手工标出了一套小数据来评价我们的方法。经验结果表明,具有类似性的法律专家模式比其他基线要强。我们的数据和代码是公开的。