The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger. Existing methods cannot generalize to various types of triggers by making certain assumptions or attack-specific constraints. The fundamental reason is that existing work does not consider the trigger's design space in their formulation of the inversion problem. This work formally defines and analyzes the triggers injected in different spaces and the inversion problem. Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from our analysis. Our prototype UNICORN is general and effective in inverting backdoor triggers in DNNs. The code can be found at https://github.com/RU-System-Software-and-Security/UNICORN.
翻译:后门攻击指黑客使用带有触发器(例如补丁)的输入来激活预先植入的恶意行为,这种攻击对深度神经网络(DNN)模型构成严重威胁。触发器反演是识别后门模型和了解嵌入式对抗行为的有效方式。触发器反演的挑战在于,有许多方法来构建触发器。现有方法不能通过某些假设或针对特定攻击的约束来推广到各种类型的触发器。根本原因是现有工作在其反演问题的公式化中没有考虑触发器的设计空间。本文正式定义和分析注入不同空间中的触发器及反演问题。然后,我们提出了一个统一的框架来反演基于触发器的后门,该框架基于触发器的形式化以及从我们的分析中确定的后门模型的内部行为。我们的原型UNICORN可以在DNN中反演后门触发器,并且具有通用性和高效性。 代码可以在https://github.com/RU-System-Software-and-Security/UNICORN中找到。