State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.
翻译:最先进的成员推断攻击通常需要训练大量参考模型,这导致难以将此类攻击扩展至大型预训练语言模型。因此,先前研究要么依赖无需训练参考模型的较弱攻击(如微调攻击),要么将强攻击应用于小型模型和数据集。然而,较弱攻击已被证明具有脆弱性,而简化场景下强攻击的洞察亦无法直接迁移至当今的大型语言模型。这些挑战引出一个关键问题:先前工作中观察到的局限性是源于攻击设计选择,还是成员推断攻击本质上对大型语言模型无效?我们通过将当前最强的成员推断攻击之一——LiRA——扩展至参数规模从1000万到10亿的GPT-2架构,并在C4数据集的超过200亿词元上训练参考模型,从而对该问题展开研究。我们的研究从四个关键方面推进了对大型语言模型成员推断攻击的理解:虽然(1)强成员推断攻击可在预训练大型语言模型上取得成功,但(2)其实际场景中的有效性仍然有限(例如AUC<0.7);(3)即使强成员推断攻击获得优于随机水平的AUC值,聚合指标仍可能掩盖显著的逐样本攻击决策不稳定性:由于训练随机性,大量决策的不稳定程度在统计上与随机抛硬币无异。最后,(4)成员推断攻击成功率与相关大型语言模型隐私指标之间的关系并不如先前研究假设的那样直接。