Safe reinforcement learning aims to learn a control policy while ensuring that neither the system nor the environment gets damaged during the learning process. For implementing safe reinforcement learning on highly nonlinear and high-dimensional dynamical systems, one possible approach is to find a low-dimensional safe region via data-driven feature extraction methods, which provides safety estimates to the learning algorithm. As the reliability of the learned safety estimates is data-dependent, we investigate in this work how different training data will affect the safe reinforcement learning approach. By balancing between the learning performance and the risk of being unsafe, a data generation method that combines two sampling methods is proposed to generate representative training data. The performance of the method is demonstrated with a three-link inverted pendulum example.
翻译:安全强化学习旨在学习控制政策,同时确保系统和环境在学习过程中不会受损。为了在高度非线性和高维动态系统中实施安全强化学习,一种可能的方法是通过数据驱动特征提取方法找到一个低维安全区域,为学习算法提供安全估计。由于学习安全估计的可靠性取决于数据,我们在此工作中调查不同的培训数据将如何影响安全强化学习方法。通过平衡学习业绩和不安全风险,建议采用将两种抽样方法结合起来的数据生成方法,以产生具有代表性的培训数据。该方法的性能通过三连环反转的圆形示例来证明。