SoK:关于巨型基金会模型不可能安全的问题 (SoK: On the Impossible Security of Very Large Foundation Models)

Large machine learning models, or so-called foundation models, aim to serve as base-models for application-oriented machine learning. Although these models showcase impressive performance, they have been empirically found to pose serious security and privacy issues. We may however wonder if this is a limitation of the current models, or if these issues stem from a fundamental intrinsic impossibility of the foundation model learning problem itself. This paper aims to systematize our knowledge supporting the latter. More precisely, we identify several key features of today's foundation model learning problem which, given the current understanding in adversarial machine learning, suggest incompatibility of high accuracy with both security and privacy. We begin by observing that high accuracy seems to require (1) very high-dimensional models and (2) huge amounts of data that can only be procured through user-generated datasets. Moreover, such data is fundamentally heterogeneous, as users generally have very specific (easily identifiable) data-generating habits. More importantly, users' data is filled with highly sensitive information, and maybe heavily polluted by fake users. We then survey lower bounds on accuracy in privacy-preserving and Byzantine-resilient heterogeneous learning that, we argue, constitute a compelling case against the possibility of designing a secure and privacy-preserving high-accuracy foundation model. We further stress that our analysis also applies to other high-stake machine learning applications, including content recommendation. We conclude by calling for measures to prioritize security and privacy, and to slow down the race for ever larger models.

翻译：大型机器学习模型,或所谓的基础模型,旨在作为应用导向型机器学习的基础模型。这些模型展示了令人印象深刻的绩效,但从经验上看,这些模型似乎造成了严重的安全和隐私问题。然而,我们可能会怀疑,这是否是当前模型的局限性,或者这些问题是否源于基础模型学习问题本身根本内在根本不可能存在的问题。本文旨在将支持后者的知识系统化。更准确地说,我们确定了当今基础模型学习问题的若干关键特征,鉴于目前对立机器学习的理解,这些特征表明与安全和隐私的高度准确性不相容。我们首先发现,高准确性似乎要求(1) 非常高的模型和(2) 大量数据只能通过用户生成的数据集采购。此外,这些数据基本上是多种多样的,因为用户通常有非常具体(容易识别)的数据生成习惯。更重要的是,用户的数据充满了高度敏感的信息,而且可能受到假用户的严重污染。我们随后对隐私保存和Byzant-revicent变量的准确性标准进行了较低的限制。我们发现,高精确性模型需要 (1) 非常高的高度的模型和高度的精确性研究,我们主张,我们主张, 也运用了一种高度的机能分析。我们用高性模型来保护性研究。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日