Privacy protection raises great attention on both legal levels and user awareness. To protect user privacy, countries enact laws and regulations requiring software privacy policies to regulate their behavior. However, privacy policies are written in natural languages with many legal terms and software jargon that prevent users from understanding and even reading them. It is desirable to use NLP techniques to analyze privacy policies for helping users understand them. Furthermore, existing datasets ignore law requirements and are limited to English. In this paper, we construct the first Chinese privacy policy dataset, namely CA4P-483, to facilitate the sequence labeling tasks and regulation compliance identification between privacy policies and software. Our dataset includes 483 Chinese Android application privacy policies, over 11K sentences, and 52K fine-grained annotations. We evaluate families of robust and representative baseline models on our dataset. Based on baseline performance, we provide findings and potential research directions on our dataset. Finally, we investigate the potential applications of CA4P-483 combing regulation requirements and program analysis.
翻译:为了保护用户隐私,各国制定法律和条例,要求软件隐私政策规范其行为,但隐私政策以自然语言写成,有许多法律术语和软件术语,使用户无法理解甚至阅读这些术语和术语;最好使用国家隐私政策技术分析隐私政策,帮助用户理解这些术语和术语;此外,现有的数据集无视法律要求,仅限于英文;我们在本文件中建立了第一个中国隐私政策数据集,即CA4P-483,以便利在隐私政策和软件之间进行顺序标识和监管合规识别;我们的数据集包括483个中国人和机器人应用隐私政策,超过11K句,52K微细图表;我们根据基线性能,对数据集的稳健和有代表性的基线模型进行分组评估;我们根据基线性能,就我们的数据集提供研究结果和潜在的研究方向;最后,我们调查了CA4P-483对监管要求和程序分析的潜在应用。