LLMs can provide substantial zero-shot performance on diverse tasks using a simple task prompt, eliminating the need for training or fine-tuning. However, when applying these models to sensitive tasks, it is crucial to thoroughly assess their robustness against adversarial inputs. In this work, we introduce Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack frameworks designed to systematically generate dynamic and adaptive adversarial examples by leveraging the understanding of the LLMs. We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text while effectively deceiving the target LLM. By utilizing an automated, LLM-driven pipeline, we eliminate the dependence on external heuristics. Our attacks evolve with the advancements in LLMs and demonstrate strong transferability across models unknown to the attacker. Overall, this work provides a systematic approach for the self-assessment of an LLM's robustness. We release our code and data at https://github.com/Shukti042/AdversarialExample.
翻译:大型语言模型(LLM)通过简单的任务提示即可在多样化任务上实现显著的零样本性能,无需训练或微调。然而,当将这些模型应用于敏感任务时,必须全面评估其对抗输入的鲁棒性。本研究提出了静态欺骗器(StaDec)和动态欺骗器(DyDec)两种创新攻击框架,通过利用对LLM的理解,系统性地生成动态自适应的对抗样本。我们生成的对抗输入既微妙又自然,在保持与原始文本语义相似性的同时,能有效欺骗目标LLM。通过采用自动化的LLM驱动流程,我们消除了对外部启发式方法的依赖。我们的攻击方法随LLM技术进步而演进,并在攻击者未知的模型间展现出强大的可迁移性。总体而言,本研究为LLM鲁棒性的自我评估提供了系统性方法。代码与数据已发布于https://github.com/Shukti042/AdversarialExample。