Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, an English corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging real-world benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations from domain experts. We present two alternative neural approaches as baselines, (1) intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. The experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. We perform a detailed error analysis to reveal the challenges of the proposed corpus.
翻译:了解隐私政策对于用户来说至关重要,因为隐私政策赋予他们了解与其相关的信息的权力。在隐私政策文件中写成的句子解释了隐私做法,其组成文字横跨传递了有关这种做法的进一步具体信息。我们提到预测一个句子中解释的隐私做法,作为意图分类,并指明文本涵盖共享具体信息作为空档填充。在这项工作中,我们提议Politie,这是一个包含5,250个意图和11,788个空档的英文文集,涵盖31个网站和移动应用程序的隐私政策。Politie-Sequeq是一个具有挑战性的现实世界基准,有有限的标签例子,反映了从域专家那里收集大规模说明的成本。我们提出了两种替代神经神经方法,作为基线:(1) 意图分类和空档作为联合序列标记,(2) 把它们建模为从序列到序列的学习任务。实验结果表明,这两种方法在意图分类方面都具有可比性,而Seq2Seqeqeq 方法超过了用大空间填充槽的顺序标记方法。我们进行了详细的错误分析,以揭示拟议中的难题。我们进行了详细分析。