Glitch tokens, inputs that trigger unpredictable or anomalous behavior in Large Language Models (LLMs), pose significant challenges to model reliability and safety. Existing detection methods primarily rely on heuristic embedding patterns or statistical anomalies within internal representations, limiting their generalizability across different model architectures and potentially missing anomalies that deviate from observed patterns. We introduce GlitchMiner, an behavior-driven framework designed to identify glitch tokens by maximizing predictive entropy. Leveraging a gradient-guided local search strategy, GlitchMiner efficiently explores the discrete token space without relying on model-specific heuristics or large-batch sampling. Extensive experiments across ten LLMs from five major model families demonstrate that GlitchMiner consistently outperforms existing approaches in detection accuracy and query efficiency, providing a generalizable and scalable solution for effective glitch token discovery. Code is available at [https://github.com/wooozihu/GlitchMiner]
翻译:故障令牌是指在大语言模型中引发不可预测或异常行为的输入,对模型的可靠性与安全性构成重大挑战。现有检测方法主要依赖启发式嵌入模式或内部表征的统计异常,这限制了其在不同模型架构间的泛化能力,并可能遗漏偏离已观测模式的异常。本文提出GlitchMiner,一种通过最大化预测熵来识别故障令牌的行为驱动框架。该方法利用梯度引导的局部搜索策略,高效探索离散令牌空间,无需依赖模型特定启发式规则或大批量采样。在涵盖五大模型家族的十个大语言模型上进行广泛实验,结果表明GlitchMiner在检测准确率和查询效率上均持续优于现有方法,为故障令牌发现提供了可泛化、可扩展的解决方案。代码发布于[https://github.com/wooozihu/GlitchMiner]。