Model based reinforcement learning has proven to be more sample efficient than model free methods. On the other hand, the construction of a dynamics model in model based reinforcement learning has increased complexity. Data processing tasks in radio astronomy are such situations where the original problem which is being solved by reinforcement learning itself is the creation of a model. Fortunately, many methods based on heuristics or signal processing do exist to perform the same tasks and we can leverage them to propose the best action to take, or in other words, to provide a `hint'. We propose to use `hints' generated by the environment as an aid to the reinforcement learning process mitigating the complexity of model construction. We modify the soft actor critic algorithm to use hints and use the alternating direction method of multipliers algorithm with inequality constraints to train the agent. Results in several environments show that we get the increased sample efficiency by using hints as compared to model free methods.
翻译:另一方面,在基于模型的强化学习中,构建动态模型增加了复杂性。射电天文学中的数据处理任务正是这样的情况:通过强化学习本身正在解决的原始问题就是创建模型。幸运的是,基于超自然学或信号处理的许多方法确实存在,以履行同样的任务,我们可以利用它们提出最佳行动,或者说,提供“最根本”的。我们提议使用环境产生的“暗物质”作为辅助强化学习过程,以缓解模型建设的复杂性。我们修改软体的批评算法,以便使用暗示和使用带有不平等限制的乘数算法交替方向方法来训练代理人。一些环境的结果显示,我们通过使用暗示来提高样本效率,而不是使用模型自由方法。