Pre-training large transformer models with in-domain data improves domain adaptation and helps gain performance on the domain-specific downstream tasks. However, sharing models pre-trained on potentially sensitive data is prone to adversarial privacy attacks. In this paper, we asked to which extent we can guarantee privacy of pre-training data and, at the same time, achieve better downstream performance on legal tasks without the need of additional labeled data. We extensively experiment with scalable self-supervised learning of transformer models under the formal paradigm of differential privacy and show that under specific training configurations we can improve downstream performance without sacrifying privacy protection for the in-domain data. Our main contribution is utilizing differential privacy for large-scale pre-training of transformer language models in the legal NLP domain, which, to the best of our knowledge, has not been addressed before.
翻译:对大型变压器模型进行内部数据培训前的大型变压器模型改进了域域适应,有助于在具体领域下游任务上取得成绩。然而,共享预先培训过的潜在敏感数据的模型容易引发对抗性隐私攻击。在本文中,我们询问我们在何种程度上能够保证培训前数据的隐私,同时在不需要额外标签数据的情况下,在法律任务上取得更好的下游业绩。我们广泛试验在差异隐私的正式模式下,对变压器模型进行可扩缩的自我监督学习,并表明在具体培训配置下,我们可以改进下游的性能,而不必对主域数据的隐私保护进行精细化。我们的主要贡献是利用差异隐私在法律的NLP领域对变压器语言模型进行大规模预先培训,而据我们所知,这些都尚未解决。