Reaching tasks with random targets and obstacles is a challenging task for robotic manipulators. In this study, we propose a novel model-free reinforcement learning approach based on proximal policy optimization (PPO) for training a deep policy to map the task space to the joint space of a 6-DoF manipulator. To facilitate the training process in a large workspace, we develop an efficient representation of environmental inputs and outputs. The calculation of the distance between obstacles and manipulator links is incorporated into the state representation using a geometry-based method. Additionally, to enhance the performance of the model in reaching tasks, we introduce the action ensembles method and design the policy to directly participate in value function updates in PPO. To overcome the challenges associated with training in real-robot environments, we develop a simulation environment in Gazebo to train the model as it produces a smaller Sim-to-Real gap compared to other simulators. However, training in Gazebo is time-intensive. To address this issue, we propose a Sim-to-Sim method to significantly reduce the training time. The trained model is then directly applied in a real-robot setup without fine-tuning. To evaluate the performance of the proposed approach, we perform several rounds of experiments in both simulated and real robots. We also compare the performance of the proposed approach with six baselines. The experimental results demonstrate the effectiveness of the proposed method in performing reaching tasks with and without obstacles. our method outperformed the selected baselines by a large margin in different reaching task scenarios. A video of these experiments has been attached to the paper as supplementary material.
翻译:实现随机目标和障碍的任务对于机器人操控者来说是一项艰巨的任务。 在这项研究中,我们提出一种新的无模型强化学习方法,其基础是近似政策优化(PPO),以培训深度政策,将任务空间映射为6-DoF操控器的联合空间。为了便利在大型工作空间的培训进程,我们开发了对环境投入和产出的高效代表。计算障碍和操控器连接之间的距离是州代表制中采用基于几何方法的一种艰巨任务。此外,为了提高模型在完成任务方面的性能,我们提出了一种无模型强化学习方法,并设计了政策,以直接参与PPPO的价值功能更新。为了克服与实际机器人环境培训相关的挑战,我们在加泽博开发了一个模拟环境环境环境,以培训模型产生比其他模拟器更小的平向实时差距。然而,在Gazebo的培训是时间密集型的。为了解决这一问题,我们建议采用一个Sima-Simer-Simim方法来大大缩短培训时间。然后,经过培训的模型将直接应用为实际-robt任务更新的六轮的大规模性工作表现方法。我们在模拟的模拟实验中,在模拟实验中进行一系列的大规模的模拟方法中进行模拟的模拟和模拟的模拟试验中进行模拟的模拟的模拟的模拟的模拟方法。