The self-attention revolution allowed generative language models to scale and achieve increasingly impressive abilities. Such models - commonly referred to as Large Language Models (LLMs) - have recently gained prominence with the general public, thanks to conversational fine-tuning, putting their behavior in line with public expectations regarding AI. This prominence amplified prior concerns regarding the misuse of LLMs and led to the emergence of numerous tools to detect LLMs in the wild. Unfortunately, most such tools are critically flawed. While major publications in the LLM detectability field suggested that LLMs were easy to detect with fine-tuned autoencoders, the limitations of their results are easy to overlook. Specifically, they assumed publicly available generative models without fine-tunes or non-trivial prompts. While the importance of these assumptions has been demonstrated, until now, it remained unclear how well such detection could be countered. Here, we show that an attacker with access to such detectors' reference human texts and output not only evades detection but can fully frustrate the detector training - with a reasonable budget and all its outputs labeled as such. Achieving it required combining common "reinforcement from critic" loss function modification and AdamW optimizer, which led to surprisingly good fine-tuning generalization. Finally, we warn against the temptation to transpose the conclusions obtained in RNN-driven text GANs to LLMs due to their better representative ability. These results have critical implications for the detection and prevention of malicious use of generative language models, and we hope they will aid the designers of generative models and detectors.
翻译:自注意力革命使生成语言模型得以扩展并实现了越来越印象深刻的能力。这样的模型通常被称为大型语言模型(LLMs)。近期,由于对话微调,此类模型已经备受广大公众的关注,使其行为符合公众对AI的预期。这种关注加大了有关LLMs误用的担忧,并导致出现大量检测LLMs工具。不幸的是,多数此类工具存在严重缺陷。LLMs在可检测性领域的主要论文认为,使用微调自编码器可以轻易检测到LLMs,但其结果的局限性很容易被忽视。具体而言,他们假设使用可公开使用的生成模型且没有进行精调或极易提示。尽管这些假设的重要性已得到证明,但直到现在,仍不清楚这种检测能力到底能否被克服。在本文中,我们展示了攻击者可以使用这些检测器的参考人类文本和输出,不仅可以逃避检测,还可以完全破坏检测器的训练,使用合理的预算和全部输出标记。该攻击使用了常见的“来自评论员的强化”损失函数修改和AdamW优化器,从而导致了惊人的精调泛化性能。最后,我们提醒人们不要因为LLMs的更好的代表性而将在RNN驱动的文本GAN中获得的结论移植到LLMs中。这些结果对于检测和预防生成语言模型的恶意使用具有重要意义,我们希望它们将有助于生成模型和检测器的设计者。