Real-world sequential decision making requires data-driven algorithms that provide practical guarantees on performance throughout training while also making efficient use of data. Model-free deep reinforcement learning represents a framework for such data-driven decision making, but existing algorithms typically only focus on one of these goals while sacrificing performance with respect to the other. On-policy algorithms guarantee policy improvement throughout training but suffer from high sample complexity, while off-policy algorithms make efficient use of data through sample reuse but lack theoretical guarantees. In order to balance these competing goals, we develop a class of Generalized Policy Improvement algorithms that combines the policy improvement guarantees of on-policy methods with the efficiency of theoretically supported sample reuse. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a variety of continuous control tasks from the DeepMind Control Suite.
翻译:现实世界顺序决策需要数据驱动算法,这种算法为整个培训期间的业绩提供实际保障,同时有效地使用数据。无模型深度强化学习是这种数据驱动决策的框架,但现有的算法通常只注重其中一个目标,而牺牲另一个目标的绩效。 在线政策算法保证了整个培训期间的政策改进,但具有很高的抽样复杂性,而非政策算法则通过抽样再利用有效利用数据,但却缺乏理论保障。为了平衡这些相互竞争的目标,我们制定了一类通用政策改进算法,将政策改进保证与理论上支持的抽样再利用效率相结合。我们通过对深海控制套件的各种连续控制任务进行广泛的实验分析,展示了这种新型算法的好处。