The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. Beyond the privacy considerations of using public data, we further question the utility of this paradigm. We scrutinize whether existing machine learning benchmarks are appropriate for measuring the ability of pretrained models to generalize to sensitive domains, which may be poorly represented in public Web data. Finally, we notice that pretraining has been especially impactful for the largest available models -- models sufficiently large to prohibit end users running them on their own devices. Thus, deploying such models today could be a net loss for privacy, as it would require (private) data to be outsourced to a more compute-powerful third party. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
翻译:通过利用在大型公共数据集上预先培训的非私营模式的转让学习能力,可以大大促进不同私人机器学习的绩效。我们严格审查这一方法。我们主要质疑使用大型网络分割的数据集是否应被视为有差别隐私保留。我们告诫说,将这些在网络数据上预先培训的模型公示为“私人”可能导致损害和损害公众对不同隐私的信任,以此作为对隐私的有意义的定义。除了使用公共数据的隐私考虑外,我们进一步质疑这一模式的效用。我们仔细研究现有的机器学习基准是否适合于衡量预先培训的模型向敏感领域推广的能力,而敏感领域在公共网络数据中的代表性可能很低。最后,我们注意到,培训前对于最大的现有模型影响特别大,足以禁止终端用户在自己的设备上运行这些模型。因此,今天使用这种模型可能是一种隐私的净损失,因为它需要(私人)数据外包给一个更具有说服力的第三方。我们最后通过讨论私人学习领域的潜在途径,因为公共培训前期越来越受欢迎和强大。