Prior studies in privacy policies frame the question answering (QA) tasks as identifying the most relevant text segment or a list of sentences from the policy document for a user query. However, annotating such a dataset is challenging as it requires specific domain expertise (e.g., law academics). Even if we manage a small-scale one, a bottleneck that remains is that the labeled data are heavily imbalanced (only a few segments are relevant) --limiting the gain in this domain. Therefore, in this paper, we develop a novel data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascaded them with noise reduction oracles. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach.
翻译:先前的隐私政策研究将回答问题(QA)的任务设定为确定最相关的文本部分或政策文件中的句子清单,供用户查询。然而,指出这样一个数据集具有挑战性,因为它需要具体的领域专长(例如法律学者)。即使我们管理一个小规模的,但一个瓶颈仍然存在,即标签数据严重失衡(只有几个部分相关) -- -- 限制了这一领域的收益。因此,在本文件中,我们根据编组检索器模型制定了新的数据增强框架,从未加标记的政策文件中获取相关文本部分,并扩大成套培训中的积极例子。此外,为了提高扩大的数据的多样性和质量,我们利用多种经过预先训练的语言模型(LMS),并用减少噪音的手柄将其升级。我们利用关于隐私QA基准的强化数据,将现有基线提升至大幅度(10 ⁇ F1),并实现50 ⁇ 的新水平F1级分。我们的通货膨胀研究进一步揭示了我们的方法的有效性。