使用贝叶斯优化和行为克隆法自动调整强化学习算法的超强参数 (Automatic tuning of hyper-parameters of reinforcement learning algorithms using Bayesian optimization with behavioral cloning)

Optimal setting of several hyper-parameters in machine learning algorithms is key to make the most of available data. To this aim, several methods such as evolutionary strategies, random search, Bayesian optimization and heuristic rules of thumb have been proposed. In reinforcement learning (RL), the information content of data gathered by the learning agent while interacting with its environment is heavily dependent on the setting of many hyper-parameters. Therefore, the user of an RL algorithm has to rely on search-based optimization methods, such as grid search or the Nelder-Mead simplex algorithm, that are very inefficient for most RL tasks, slows down significantly the learning curve and leaves to the user the burden of purposefully biasing data gathering. In this work, in order to make an RL algorithm more user-independent, a novel approach for autonomous hyper-parameter setting using Bayesian optimization is proposed. Data from past episodes and different hyper-parameter values are used at a meta-learning level by performing behavioral cloning which helps improving the effectiveness in maximizing a reinforcement learning variant of an acquisition function. Also, by tightly integrating Bayesian optimization in a reinforcement learning agent design, the number of state transitions needed to converge to the optimal policy for a given task is reduced. Computational experiments reveal promising results compared to other manual tweaking and optimization-based approaches which highlights the benefits of changing the algorithm hyper-parameters to increase the information content of generated data.

翻译：机器学习算法中若干超优参数的优化设置是充分利用现有数据的关键。为此,提出了若干方法,例如演化战略、随机搜索、巴耶斯优化和粗略的拇指规则。在强化学习(RL)中,学习代理在与环境互动时收集的数据的信息内容在很大程度上取决于许多超参数的设置。因此,RL算法的用户必须依赖基于搜索的优化方法,如格搜索或Nelderer-Mead-Mead简单算法,这些方法对大多数RL任务来说效率非常低,大大降低了学习曲线,并让用户承担了有意偏向数据收集的负担。在这项工作中,为了使RL算法更加依赖用户,提出了一种使用Bayesian最优化进行自主超常参数设置的新办法。以往情况的数据和不同的超均标值在基于元学习的层面上使用,进行行为克隆有助于提高获取功能的强化学习变异功能的有效性,大大降低了学习曲线,使学习曲线曲线的曲线大幅下降,使Bayesbalimate assimalimal 将改进了Bayspalimalimate State Streal iming eximing prilling prilling 需要将优化到更新到更新为最佳的升级的进度。