Evaluating rare but high-stakes events is one of the main challenges in obtaining reliable reinforcement learning policies, especially in large or infinite state/action spaces where limited scalability dictates a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. This paper proposes the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence property of APE in the tabular setting. Our empirical studies show that APE can estimate the rare event probability with a smaller bias while only using orders of magnitude fewer samples than baselines in multi-agent and single-agent environments.
翻译:在获得可靠的强化学习政策方面的主要挑战之一是评估稀有但高吞量事件,特别是在大型或无限的州/行动空间,在这种地方,可缩放性有限导致测试迭代数量之多令人望而却步;另一方面,在安全临界系统中进行偏差或不准确的政策评价可能会在部署过程中造成意外的灾难性失败。本文件建议采用加速政策评价方法,同时发现稀有事件并估计Markov决策程序中的稀有事件概率。APE方法将环境性质作为对抗性代理人对待,并通过适应性重要性抽样,学习政策评价的零变化抽样分布。此外,APE通过吸收功能吸附器,可以伸缩到大型离散或连续的空间。我们调查表格设置中APE的趋同性。我们的经验研究表明,APE只使用比多试剂和单一试剂环境中的基线数量更少的样本来估计稀有事件概率,但偏差较小。