Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.
翻译:记忆系统是实现LLM与AI智能体长期学习与持续交互的关键组件。然而,在记忆存储与检索过程中,这些系统常出现记忆幻觉现象,包括虚构、错误、冲突与遗漏。现有对记忆幻觉的评估主要采用端到端的问答形式,难以定位幻觉在记忆系统内部产生的具体操作阶段。为此,我们提出了首个面向记忆系统的操作级幻觉评估基准——记忆幻觉基准(HaluMem)。HaluMem定义了三种评估任务(记忆提取、记忆更新与记忆问答),以全面揭示交互过程中不同操作阶段的幻觉行为。为支持评估,我们构建了以用户为中心的多轮人机交互数据集HaluMem-Medium与HaluMem-Long。两者均包含约1.5万个记忆点与3.5千道多类型问题,单用户平均对话轮数分别达到1.5千轮与2.6千轮,上下文长度超过100万词元,可评估不同上下文规模与任务复杂度下的幻觉表现。基于HaluMem的实证研究表明,现有记忆系统在提取与更新阶段易产生并累积幻觉,进而将错误传播至问答阶段。未来研究应致力于开发可解释且受约束的记忆操作机制,以系统性抑制幻觉并提升记忆可靠性。