In this paper, we study the response of large models from the BERT family to incoherent inputs that should confuse any model that claims to understand natural language. We define simple heuristics to construct such examples. Our experiments show that state-of-the-art models consistently fail to recognize them as ill-formed, and instead produce high confidence predictions on them. As a consequence of this phenomenon, models trained on sentences with randomly permuted word order perform close to state-of-the-art models. To alleviate these issues, we show that if models are explicitly trained to recognize invalid inputs, they can be robust to such attacks without a drop in performance.
翻译:在本文中,我们研究了来自BERT家族的大型模型对不连贯的投入的反应,这种投入应该混淆任何声称理解自然语言的模型。我们定义了用来构建此类例子的简单超自然学。我们的实验表明,最先进的模型始终没有认识到它们不完善,而是对它们产生了高度的信心预测。由于这种现象,经过随机变异的字词顺序判决培训的模型与最先进的模型很接近。为了缓解这些问题,我们表明,如果模型经过明确培训,能够识别无效的投入,那么它们就能够在不出现性能下降的情况下对此类攻击具有很强的威力。