Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.
翻译:在强化学习中,如何设计信息丰富且密集的奖励函数仍然是一个关键挑战,因为它直接影响智能体训练的效能。本文利用基于有限迹的定量线性时序逻辑($\text{LTL}_f[\mathcal{F}]$)的表达能力,合成为运行时可观测状态轨迹生成密集奖励流的奖励监控器。通过在训练过程中提供细致的反馈,这些监控器能够引导智能体朝向最优行为,并有助于缓解当前文献中占主导地位的布尔语义下,长时序决策中普遍存在的奖励稀疏问题。我们的框架与算法无关,仅依赖于状态标记函数,并自然地支持非马尔可夫性质的规约。实验结果表明,我们的定量监控器在任务完成度的定量度量最大化与收敛时间缩短方面,始终优于布尔监控器,且其优势随环境不同而有所差异。