Targeted training-set attacks inject malicious instances into the training set to cause a trained model to mislabel one or more specific test instances. This work proposes the task of target identification, which determines whether a specific test instance is the target of a training-set attack. Target identification can be combined with adversarial-instance identification to find (and remove) the attack instances, mitigating the attack with minimal impact on other predictions. Rather than focusing on a single attack method or data modality, we build on influence estimation, which quantifies each training instance's contribution to a model's prediction. We show that existing influence estimators' poor practical performance often derives from their over-reliance on training instances and iterations with large losses. Our renormalized influence estimators fix this weakness; they far outperform the original estimators at identifying influential groups of training examples in both adversarial and non-adversarial settings, even finding up to 100% of adversarial training instances with no clean-data false positives. Target identification then simplifies to detecting test instances with anomalous influence values. We demonstrate our method's effectiveness on backdoor and poisoning attacks across various data domains, including text, vision, and speech, as well as against a gray-box, adaptive attacker that specifically optimizes the adversarial instances to evade our method. Our source code is available at https://github.com/ZaydH/target_identification.
翻译:有针对性的培训设定攻击将恶意事件注入培训内容中,使培训模式错误地标出一个或多个特定测试实例。 这项工作提出目标识别任务, 确定特定测试实例是否是培训设定攻击的目标。 目标识别可以与辨别( 排除)攻击案例的对抗- 报复性识别相结合, 减轻袭击, 对其他预测影响最小。 我们不注重单一攻击方法或数据模式, 扩大影响估计, 量化每个培训实例对模型预测的贡献。 我们表明, 现有影响估计者的实际表现不佳, 往往源于他们对培训实例和损失巨大的迭代过度依赖。 我们的重新规范化影响估计者修正了这一弱点; 目标识别可以远远超出最初的估算, 找出对敌对和非对抗性袭击案例有影响力的群体, 甚至找到100%的对抗性培训案例, 并且没有干净的数据错误的阳性。 目标识别然后简单化到检测具有反常态影响值的测试实例。 我们的方法的调整性评估性评估者, 具体地展示了我们的方法, 包括了我们最坏的判断性攻击工具, 以及我们最坏的图像 。