Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p < 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.
翻译:安全强化学习(SafeRL)是自动驾驶领域的重要范式,要求智能体在严格的安全约束下优化性能。这一双重目标产生了根本性矛盾:过于保守的策略会限制驾驶效率,而激进的探索则可能违反安全要求。通过为智能体配备未来约束违反的预测模型,安全表征促进安全策略学习(SRPL)框架应对了这一挑战,并在受控环境中展现出潜力。本文研究SRPL框架是否适用于现实世界的自动驾驶场景。在Waymo开放运动数据集(WOMD)和NuPlan上的系统实验表明,SRPL能够改善奖励与安全的权衡关系,在成功率(效应量r = 0.65-0.86)和成本降低(效应量r = 0.70-0.83)方面取得统计学显著改进(观测改进的p值均小于0.05)。但其有效性依赖于底层策略优化器和数据集分布。结果进一步表明,预测性安全表征对提升观测噪声鲁棒性具有关键作用。此外,在零样本跨数据集评估中,采用SRPL增强的智能体相较于非SRPL方法展现出更好的泛化能力。这些发现共同证明了预测性安全表征在强化自动驾驶安全强化学习方面的潜力。