Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.
翻译:虽然深度强化学习(DRL)拥有许多成功案例,但是在安全关键场景下大规模部署通过这些先进技术学习到的策略受到正式保证的影响。变分马尔可夫决策过程(VAE-MDPs)是提供从任何RL策略中提取正式可验证控制器的可靠框架的离散潜变量空间模型。尽管相关的保证解决了实际方面的重要问题,如性能和安全性质的满足,但VAE方法存在一些学习缺陷(后验崩溃,学习速度慢,动态估计不良),主要是由于缺乏支持隐含优化的抽象和表示保证。我们引入了Wasserstein自编码MDP(WAE-MDP),这是一个潜在空间模型,通过最小化代表执行原始策略的代理和蒸馏策略之间的最优传输的惩罚形式来解决这些问题,其中正式保证适用。我们的方法在学习蒸馏策略的同时产生等价关系保证,允许具体优化表示模型的质量。我们的实验表明,除了提取策略的速度快了10倍,潜在模型质量总体上确实更好。此外,我们展示了一个基于简单时间-到-故障验证算法的潜在空间实验。我们的方法使得这样的简单验证技术成为可能,突显了其适用性。