Leakage of data from publicly available Machine Learning (ML) models is an area of growing significance as commercial and government applications of ML can draw on multiple sources of data, potentially including users' and clients' sensitive data. We provide a comprehensive survey of contemporary advances on several fronts, covering involuntary data leakage which is natural to ML models, potential malevolent leakage which is caused by privacy attacks, and currently available defence mechanisms. We focus on inference-time leakage, as the most likely scenario for publicly available models. We first discuss what leakage is in the context of different data, tasks, and model architectures. We then propose a taxonomy across involuntary and malevolent leakage, available defences, followed by the currently available assessment metrics and applications. We conclude with outstanding challenges and open questions, outlining some promising directions for future research.
翻译:公开可得的机器学习(ML)模型的数据泄漏是一个日益重要的领域,因为ML的商业和政府应用可以利用多种数据来源,其中可能包括用户和客户的敏感数据。我们全面调查几个方面的当代进展,包括ML模型自然的非自愿数据泄漏、隐私攻击造成的潜在的恶意渗漏以及现有的防御机制。我们侧重于推断时间渗漏,这是公开可得模型最可能的假想。我们首先讨论不同数据、任务和模型结构中哪些是渗漏。我们然后提议对非自愿和恶意渗漏、现有防御、现有评估指标和应用进行分类。我们最后提出了悬而未决的挑战和未决问题,概述了今后研究的一些有希望的方向。