Large machine learning models, or so-called foundation models, aim to serve as base-models for application-oriented machine learning. Although these models showcase impressive performance, they have been empirically found to pose serious security and privacy issues. We may however wonder if this is a limitation of the current models, or if these issues stem from a fundamental intrinsic impossibility of the foundation model learning problem itself. This paper aims to systematize our knowledge supporting the latter. More precisely, we identify several key features of today's foundation model learning problem which, given the current understanding in adversarial machine learning, suggest incompatibility of high accuracy with both security and privacy. We begin by observing that high accuracy seems to require (1) very high-dimensional models and (2) huge amounts of data that can only be procured through user-generated datasets. Moreover, such data is fundamentally heterogeneous, as users generally have very specific (easily identifiable) data-generating habits. More importantly, users' data is filled with highly sensitive information, and maybe heavily polluted by fake users. We then survey lower bounds on accuracy in privacy-preserving and Byzantine-resilient heterogeneous learning that, we argue, constitute a compelling case against the possibility of designing a secure and privacy-preserving high-accuracy foundation model. We further stress that our analysis also applies to other high-stake machine learning applications, including content recommendation. We conclude by calling for measures to prioritize security and privacy, and to slow down the race for ever larger models.
翻译:大型机器学习模型,或所谓的基础模型,旨在作为应用导向型机器学习的基础模型。这些模型展示了令人印象深刻的绩效,但从经验上看,这些模型似乎造成了严重的安全和隐私问题。然而,我们可能会怀疑,这是否是当前模型的局限性,或者这些问题是否源于基础模型学习问题本身根本内在根本不可能存在的问题。本文旨在将支持后者的知识系统化。更准确地说,我们确定了当今基础模型学习问题的若干关键特征,鉴于目前对立机器学习的理解,这些特征表明与安全和隐私的高度准确性不相容。我们首先发现,高准确性似乎要求(1) 非常高的模型和(2) 大量数据只能通过用户生成的数据集采购。此外,这些数据基本上是多种多样的,因为用户通常有非常具体(容易识别)的数据生成习惯。更重要的是,用户的数据充满了高度敏感的信息,而且可能受到假用户的严重污染。我们随后对隐私保存和Byzant-revicent变量的准确性标准进行了较低的限制。我们发现,高精确性模型需要 (1) 非常高的高度的模型和高度的精确性研究,我们主张,我们主张, 也运用了一种高度的机能分析。我们用高性模型来保护性研究。