Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
翻译:奖励规范在强化学习(RL)中起着核心作用,指导智能体的行为。为表达非马尔可夫奖励,已有研究引入奖励机等形式化方法以捕捉对历史轨迹的依赖。然而,传统奖励机缺乏对精确时序约束的建模能力,限制了其在时间敏感应用中的使用。本文提出定时奖励机(TRMs),作为奖励机的扩展,将时序约束融入奖励结构中。TRMs支持更具表达力的可调奖励逻辑规范,例如对延迟施加惩罚、对及时行动给予奖励。我们研究了在数字语义与实时语义下,基于TRMs学习最优策略的无模型RL框架(即表格Q学习)。所提算法通过定时自动机的抽象将TRM集成至学习过程,并采用反事实想象启发式方法,利用TRM的结构特性以优化搜索。实验表明,在经典RL基准测试中,我们的算法能够学习到在满足TRM所定义时序约束的同时获得高奖励的策略。此外,我们对比了不同TRM语义下的性能表现,并通过消融实验验证了反事实想象机制的有效性。