通过改进方案最佳政策优化,在联合空间避免机器人操纵器的障碍 (Obstacle Avoidance for Robotic Manipulator in Joint Space via Improved Proximal Policy Optimization)

Reaching tasks with random targets and obstacles can still be challenging when the robotic arm is operating in unstructured environments. In contrast to traditional model-based methods, model-free reinforcement learning methods do not require complex inverse kinematics or dynamics equations to be calculated. In this paper, we train a deep neural network via an improved Proximal Policy Optimization (PPO) algorithm, which aims to map from task space to joint space for a 6-DoF manipulator. In particular, we modify the original PPO and design an effective representation for environmental inputs and outputs to train the robot faster in a larger workspace. Firstly, a type of action ensemble is adopted to improve output efficiency. Secondly, the policy is designed to join in value function updates directly. Finally, the distance between obstacles and links of the manipulator is calculated based on a geometry method as part of the representation of states. Since training such a task in real-robot is time-consuming and strenuous, we develop a simulation environment to train the model. We choose Gazebo as our first simulation environment since it often produces a smaller Sim-to-Real gap than other simulators. However, the training process in Gazebo is time-consuming and takes a long time. Therefore, to address this limitation, we propose a Sim-to-Sim method to reduce the training time significantly. The trained model is finally used in a real-robot setup without fine-tuning. Experimental results showed that using our method, the robot was capable of tracking a single target or reaching multiple targets in unstructured environments.

翻译：当机器人臂在非结构化环境中运行时,随机目标和障碍的任务达成仍然具有挑战性。与传统的基于模型的方法相反,不使用模型的强化学习方法并不要求计算复杂的反动运动或动态方程式。在本文中,我们通过改进的极优政策优化(PPO)算法来培训深神经网络,该算法的目的是从任务空间到6-DoF操纵机的联合空间进行绘图。特别是,我们修改原来的PPO,设计一个有效的环境投入和产出代表,以便在更大的工作空间中更快地培训机器人。首先,采用一种行动组合来提高产出效率。第二,该政策旨在直接加入价值函数的更新。最后,操纵器的障碍和连接的距离是根据地理测量法计算的,该算法的目的是从任务空间到6-DoF操纵机操作机。由于在实时机器人中培训既耗时又费力,我们开发了一个模拟环境来培训模型。我们选择Gazebo作为我们的第一个模拟环境,因为它经常产生更小的Sim-to-Real 目标,而我们用一个经过训练的精细的精细时间跟踪方法来计算出一个比其他Sim-real-real-real-real-lato laft lade a laut a laut a laut a laut a laut a laut a lax laut a laut a lati lati latictor a latime a la la la la lati lati la lati lati lati latime a latical lator_ a lad latical lad lator_ a lad a later lad a lad a lad a lati later later la la la la later lad later lad a later lad lad lad lad lad later lad lad lad lad lad la la lad lad lad lad lad lad lad lad lad la lad lad la la la la