Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, we believe that delineating their susceptibility to clean-label poisoning, and developing methods for overcoming this susceptibility, are critical research directions to pursue.
翻译:神经网络的运行依赖于一种隐式偏差:梯度下降在拟合训练数据时倾向于以能够泛化至未见数据的方式进行。近年来,结构化状态空间模型(SSMs)作为Transformer的高效替代方案,日益受到关注。先前的研究认为,在数据由低维教师模型生成的情境下,SSMs的隐式偏差会导向泛化能力。本文重新审视该情境,并正式揭示了一项先前关于SSMs隐式偏差的研究完全未察觉的现象。具体而言,我们证明,尽管隐式偏差在许多训练数据选择下确实能实现泛化,但存在某些特殊样本,一旦将其纳入训练,便会彻底扭曲隐式偏差,导致泛化失败。这种失败的发生,恰恰是因为这些特殊训练样本是由教师模型标注的,即它们具有干净标签!我们通过实验验证了这一现象,分别对独立训练的SSMs以及作为非线性神经网络组成部分的SSMs进行了测试。在对抗性机器学习领域,使用干净标签的训练样本来破坏泛化能力的行为被称为干净标签毒化攻击。鉴于SSMs的广泛应用,我们认为,明确其对于干净标签毒化的易感性,并开发克服该易感性的方法,是亟待推进的关键研究方向。