Reinforcement learning (RL) has achieved some impressive recent successes in various computer games and simulations. Most of these successes are based on having large numbers of episodes from which the agent can learn. In typical robotic applications, however, the number of feasible attempts is very limited. In this paper we present a sample-efficient RL algorithm applied to the example of a table tennis robot. In table tennis every stroke is different, with varying placement, speed and spin. An accurate return therefore has to be found depending on a high-dimensional continuous state space. To make learning in few trials possible the method is embedded into our robot system. In this way we can use a one-step environment. The state space depends on the ball at hitting time (position, velocity, spin) and the action is the racket state (orientation, velocity) at hitting. An actor-critic based deterministic policy gradient algorithm was developed for accelerated learning. Our approach performs competitively both in a simulation and on the real robot in a number of challenging scenarios. Accurate results are obtained without pre-training in under $200$ episodes of training. The video presenting our experiments is available at https://youtu.be/uRAtdoL6Wpw.
翻译:在各种计算机游戏和模拟中,强化学习(RL)最近取得了一些令人印象深刻的成功。这些成功大多基于大量事件,代理商可以从中学习。但是,在典型的机器人应用中,可行的尝试数量非常有限。在本文中,我们提出了一个适用于网球机器人范例的样本高效RL算法。在网球中,每个中风都不同,位置、速度和旋转各异。因此,必须依靠高维连续状态空间找到准确的回报。为了在少数试验中学习,该方法可以嵌入我们的机器人系统。这样,我们就可以使用一个单步环境。州空间取决于打球的时间(位置、速度、旋转),而行动是打球时的电动状态(方向、速度)。基于确定性的政策梯度算法是为了加速学习而开发的。我们的方法在模拟中和一些具有挑战性的情况中都具有竞争力。在低于200美元的训练阶段中,在未接受培训前就取得了准确的结果。展示我们实验的视频可在 https://yoou6/Wbeuwew上查到。