Trojan attacks raise serious security concerns. In this paper, we investigate the underlying mechanism of Trojaned BERT models. We observe the attention focus drifting behavior of Trojaned models, i.e., when encountering an poisoned input, the trigger token hijacks the attention focus regardless of the context. We provide a thorough qualitative and quantitative analysis of this phenomenon, revealing insights into the Trojan mechanism. Based on the observation, we propose an attention-based Trojan detector to distinguish Trojaned models from clean ones. To the best of our knowledge, this is the first paper to analyze the Trojan mechanism and to develop a Trojan detector based on the transformer's attention.
翻译:Trojan攻击引起了严重的安全关切。 在本文中, 我们调查了 Trojaned BERT 模型的基本机制。 我们观察了 Trojaned 模型的注意焦点漂移行为, 也就是说, 当遇到有毒输入时, 触发符号会劫持注意力焦点, 不论背景如何。 我们对此现象进行了彻底的定性和定量分析, 揭示了对Trojan 机制的洞察力。 根据观察, 我们建议使用关注基的Trojan 探测器来区分Trojan 模型和清洁模型。 据我们所知, 这是第一份文件, 分析Trojan 机制, 并根据变压器的注意开发一个Trojan 探测器。