Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. However, existing works and our pilot study have shown that as dialogue histories grow in length and accumulate noise, current long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set using temporal and relevance filters, followed by an RL agent that selects the precise evidence sessions. The RL training is guided by a multi-level reward function optimizing (i) answer accuracy, (ii) evidence grounding, and (iii) temporal consistency. In particular, the temporal consistency reward provides a dense signal by evaluating alignment with the query time scope at both the session-level (chronological proximity) and the utterance-level (chronological fidelity), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contribute to a 15.0\% performance gain. Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories. The code and datasets are publicly available at https://github.com/Elvin-Yiming-Du/Memory-T1/
翻译:在多轮长会话对话中进行时序推理是对话智能体的关键能力。然而,现有研究及我们的初步实验表明,随着对话历史长度增加并累积噪声,当前的长上下文模型难以准确识别时序相关信息,严重损害了推理性能。为此,我们提出Memory-T1框架,该框架通过强化学习训练具备时间感知能力的记忆选择策略。它采用由粗到精的策略:首先使用时序过滤器与相关性过滤器将对话历史修剪为候选集,随后通过强化学习智能体精确选择证据会话。强化学习训练由多层级奖励函数引导,该函数优化(i)答案准确性、(ii)证据可追溯性与(iii)时序一致性。特别地,时序一致性奖励通过在会话层面(时序邻近性)和语句层面(时序保真度)评估与查询时间范围的匹配程度,提供密集的奖励信号,使智能体能够解析细微的时序歧义。在Time-Dialog基准测试中,Memory-T1将7B参数模型的总体得分提升至67.0%,创造了开源模型的新最优性能,较14B基线模型提升10.2%。消融实验表明时序一致性与证据可追溯性奖励共同贡献了15.0%的性能增益。此外,Memory-T1在128k令牌长度内保持稳健性能(此时基线模型已失效),证明了其对长对话历史噪声的有效处理能力。代码与数据集已公开于https://github.com/Elvin-Yiming-Du/Memory-T1/。