The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.
翻译:对稀有但高摄入量事件的评价仍然是从智能剂获取可靠政策的主要困难之一,特别是在大型或连续的州/行动空间,在这种空间,可扩缩性有限迫使使用大量测试迭代;另一方面,在安全临界系统中进行偏差或不准确的政策评价可能会在部署期间造成意外的灾难性失败;在本文件中,我们提议加速政策评价方法,该方法同时发现稀有事件并估计Markov决策程序中的稀有事件概率。 加速政策评价方法将环境性质作为对抗性代理人对待,并通过适应性重要性抽样,了解政策评价的零变化抽样分布。此外,加速评价通过采用功能相近器,可以向大型离散或连续空间扩展。我们调查在适当正常条件下拟议的算法的趋同特性。我们的经验研究表明,加速政策评价方法估计稀有事件概率,但差异较小,同时只使用数量较少的样本与多试剂和单一试剂环境中的基准方法相比,仅使用数量较少的样本。