(This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.) To improve the efficiency of deep reinforcement learning (DRL)-based methods for robot manipulator trajectory planning in random working environments, we present three dense reward functions. These rewards differ from the traditional sparse reward. First, a posture reward function is proposed to speed up the learning process with a more reasonable trajectory by modeling the distance and direction constraints, which can reduce the blindness of exploration. Second, a stride reward function is proposed to improve the stability of the learning process by modeling the distance and movement distance of joint constraints. Finally, in order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism, including a hard stage incentive reward function and a soft stage incentive reward function. Extensive experiments show that the soft stage incentive reward function is able to improve the convergence rate by up to 46.9% with the state-of-the-art DRL methods. The percentage increase in the convergence mean reward was 4.4-15.5% and the percentage decreases with respect to standard deviation were 21.9-63.2%. In the evaluation experiments, the success rate of trajectory planning for a robot manipulator reached 99.6%.
翻译:(这项工作已提交IEEEE, 供可能出版。 版权可以不经通知转让, 之后本版本可能无法再进入。 )为了提高在随机工作环境中以机器人操纵者轨迹规划为基础的深强化学习( DRL)方法的效率,我们提出了三种密集的奖赏功能。 这些奖赏与传统的稀有奖励不同。 首先, 提议一个姿态奖励功能, 以更合理的轨迹加快学习过程, 以模拟距离和方向限制, 从而降低勘探的失明程度。 其次, 提议一个跳跃奖励功能, 通过模拟联合限制的距离和移动距离来提高学习过程的稳定性。 最后, 为了进一步提高学习效率, 我们受到人类行为认知过程的启发, 并提出一个阶段奖励机制, 包括硬阶段奖励功能和软阶段奖励功能。 广泛的实验表明, 软阶段奖励功能能够提高趋同率, 达到46.9%, 从而降低探索的失明程度。 趋同率提高的百分比是4.-15.5 %, 与标准飞行轨迹成功率降低的百分比是21.9-6. 。