It remains challenging to deploy existing risk-averse approaches to real-world applications. The reasons are multi-fold, including the lack of global optimality guarantee and the necessity of learning from long-term consecutive trajectories. Long-term consecutive trajectories are prone to involving visiting hazardous states, which is a major concern in the risk-averse setting. This paper proposes Short-Term VOlatility-controlled Policy Search (STOPS), a novel algorithm that solves risk-averse problems by learning from short-term trajectories instead of long-term trajectories. Short-term trajectories are more flexible to generate, and can avoid the danger of hazardous state visitations. By using an actor-critic scheme with an overparameterized two-layer neural network, our algorithm finds a globally optimal policy at a sublinear rate with proximal policy optimization and natural policy gradient, with effectiveness comparable to the state-of-the-art convergence rate of risk-neutral policy-search methods. The algorithm is evaluated on challenging Mujoco robot simulation tasks under the mean-variance evaluation metric. Both theoretical analysis and experimental results demonstrate a state-of-the-art level of STOPS' performance among existing risk-averse policy search methods.
翻译:对现实世界应用采用现有的反风险方法仍然具有挑战性。其原因有多重,包括缺乏全球最佳性保障和从长期连续的轨迹中学习的必要性。长期连续的轨迹容易涉及访问危险国家,这是风险规避环境中的一个主要问题。本文件提出了短期机动性控制政策搜索(STOPS),这是一种新颖的算法,它通过学习短期轨迹而不是长期轨迹来解决反风险问题。短期轨迹更灵活地生成,并能够避免危险国家访问的危险。通过使用一个具有超分度的两层神经网络的演员-轨迹计划,我们的算法发现全球最佳政策处于亚线速率,政策优化和自然政策梯度,其效力可与风险中性政策研究方法的最新趋同率相比。短期轨迹更灵活,可以产生短期轨迹,并可以避免危险国家访问的危险。通过使用一个具有超分度的两层神经网络的演员-极轨迹图,我们的算法发现一种全球最佳的亚线速政策政策,其效力可与风险-中性政策研究方法的最新趋同率一致率率。在中,在中以中具有挑战性的穆乔科机器人模拟模拟模拟模拟模拟模拟任务中,在平均的搜索风险评估度度度度上展示式政策风险评估中展示。