Reflecting on the last few years, the biggest breakthroughs in deep reinforcement learning (RL) have been in the discrete action domain. Robotic manipulation, however, is inherently a continuous control environment, but these continuous control reinforcement learning algorithms often depend on actor-critic methods that are sample-inefficient and inherently difficult to train, due to the joint optimisation of the actor and critic. To that end, we explore how we can bring the stability of discrete action RL algorithms to the robot manipulation domain. We extend the recently released ARM algorithm, by replacing the continuous next-best pose agent with a discrete next-best pose agent. Discretisation of rotation is trivial given its bounded nature, while translation is inherently unbounded, making discretisation difficult. We formulate the translation prediction as the voxel prediction problem by discretising the 3D space; however, voxelisation of a large workspace is memory intensive and would not work with a high density of voxels, crucial to obtaining the resolution needed for robotic manipulation. We therefore propose to apply this voxel prediction in a coarse-to-fine manner by gradually increasing the resolution. In each step, we extract the highest valued voxel as the predicted location, which is then used as the centre of the higher-resolution voxelisation in the next step. This coarse-to-fine prediction is applied over several steps, giving a near-lossless prediction of the translation. We show that our new coarse-to-fine algorithm is able to accomplish RLBench tasks much more efficiently than the continuous control equivalent, and even train some real-world tasks, tabular rasa, in less than 7 minutes, with only 3 demonstrations. Moreover, we show that by moving to a voxel representation, we are able to easily incorporate observations from multiple cameras.
翻译:在过去几年中,深加学习(RL)的最大突破是在离散的动作域中。 但是,机器人操纵本质上是一个连续的控制环境, 但这些连续的控制强化学习算法往往取决于由于演员和评论家联合优化而导致的样本效率低和本质上难以培训的演员- 批评方法。 为此,我们探索如何将离散的 RL 算法的稳定性带到机器人操作域。 我们扩展了最近发布的ARM 算法, 将连续的下一个最佳代理代理器替换为离散的下一个最佳化代理器。 旋转的分解是微不足道的, 因为它的外观性质是封闭的, 使分解变得困难。 我们通过分解 3D 空间的分解, 将翻译预测作为 voxel 的预测问题。 然而, 大工作空间的氧化作用是记忆的密集度, 并且不会仅仅以高等值的 voxels 来工作, 使得我们更易获得机器人操作所需的解算法。 因此, 我们提议将这段的解算法预测应用于一个近至最接近最高级的转换, 列的递解算法的路径, 渐渐渐渐渐渐渐地显示在不断的路径中。