Prior studies in privacy policies frame the question answering (QA) tasks as identifying the most relevant text segment or a list of sentences from the policy document for a user query. However, annotating such a dataset is challenging as it requires specific domain expertise (e.g., law academics). Even if we manage a small-scale one, a bottleneck that remains is that the labeled data are heavily imbalanced (only a few segments are relevant) --limiting the gain in this domain. Therefore, in this paper, we develop a novel data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascaded them with noise reduction oracles. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%. Our ablation studies provide further insights into the effectiveness of our approach.
翻译:先前的隐私政策研究将回答问题(QA)的任务设定为确定最相关的文本部分或政策文件中的句子清单,供用户查询。然而,指出这样一个数据集具有挑战性,因为它需要具体的领域专长(例如法律学者)。即使我们管理一个小规模的,一个瓶颈仍然存在,即标签数据严重失衡(只有几个部分相关) -- -- 限制了这个领域的收益。因此,在本文件中,我们根据编组检索器模型制定了新的数据增强框架,从未标的政策文件中获取相关的文本部分,并扩大培训组中的积极例子。此外,为了提高扩大的数据的多样性和质量,我们利用了多种经过预先培训的语言模型(LMS),并把它们与减少噪音或触手相加。我们利用关于隐私QA基准的扩大数据,我们将现有基线提升了大幅度(10% F1),并实现了50%的新水平F1分。我们的通货膨胀研究进一步揭示了我们的方法的有效性。