This paper introduces the Saudi Privacy Policy Dataset, a diverse compilation of Arabic privacy policies from various sectors in Saudi Arabia, annotated according to the 10 principles of the Personal Data Protection Law (PDPL); the PDPL was established to be compatible with General Data Protection Regulation (GDPR); one of the most comprehensive data regulations worldwide. Data were collected from multiple sources, including the Saudi Central Bank, the Saudi Arabia National United Platform, the Council of Health Insurance, and general websites using Google and Wikipedia. The final dataset includes 1,000 websites belonging to 7 sectors, 4,638 lines of text, 775,370 tokens, and a corpus size of 8,353 KB. The annotated dataset offers significant reuse potential for assessing privacy policy compliance, benchmarking privacy practices across industries, and developing automated tools for monitoring adherence to data protection regulations. By providing a comprehensive and annotated dataset of privacy policies, this paper aims to facilitate further research and development in the areas of privacy policy analysis, natural language processing, and machine learning applications related to privacy and data protection, while also serving as an essential resource for researchers, policymakers, and industry professionals interested in understanding and promoting compliance with privacy regulations in Saudi Arabia.
翻译:本论文介绍了沙特隐私政策数据集,该数据集是根据个人数据保护法(PDPL)的10个原则进行注释的阿拉伯语隐私政策的多样化编译,PDPL是建立为与通用数据保护法规(GDPR)兼容的最全面的数据法规之一。数据收集自多个来源,包括沙特中央银行、沙特阿拉伯国家联合平台、医疗保险理事会和使用谷歌和维基百科的一般网站。最终数据集包括了7个行业的1,000个网站,共4,638行文本,775,370个标记和8,353 KB的语料库大小。通过提供一份综合注释的隐私政策数据集,本文旨在促进隐私政策分析、自然语言处理和与隐私和数据保护相关的机器学习应用的进一步研究和开发,同时也是研究人员、政策制定者和行业专业人士在了解和促进沙特阿拉伯隐私法规遵守方面的必要资源。