Deep reinforcement learning (DRL) has demonstrated its potential in solving complex manufacturing decision-making problems, especially in a context where the system learns over time with actual operation in the absence of training data. One interesting and challenging application for such methods is the assembly sequence planning (ASP) problem. In this paper, we propose an approach to the implementation of DRL methods in ASP. The proposed approach introduces in the RL environment parametric actions to improve training time and sample efficiency and uses two different reward signals: (1) user's preferences and (2) total assembly time duration. The user's preferences signal addresses the difficulties and non-ergonomic properties of the assembly faced by the human and the total assembly time signal enforces the optimization of the assembly. Three of the most powerful deep RL methods were studied, Advantage Actor-Critic (A2C), Deep Q-Learning (DQN), and Rainbow, in two different scenarios: a stochastic and a deterministic one. Finally, the performance of the DRL algorithms was compared to tabular Q-Learnings performance. After 10,000 episodes, the system achieved near optimal behaviour for the algorithms tabular Q-Learning, A2C, and Rainbow. Though, for more complex scenarios, the algorithm tabular Q-Learning is expected to underperform in comparison to the other 2 algorithms. The results support the potential for the application of deep reinforcement learning in assembly sequence planning problems with human interaction.
翻译:深度强化学习(DRL)已经在解决复杂的制造决策问题方面展示了其潜力,特别是在没有训练数据的情况下,系统可以在实际操作中学习。一个有趣且具有挑战性的应用是装配序列规划 (ASP) 问题。在本文中,我们提出了一个DRL方法在ASP中的应用。该方法引入参数化的操作以提高训练时间和采样效率,并使用两个不同的奖励信号:(1) 用户的偏好和 (2)总装配时间。用户的偏好信号解决了人类面临的安装困难和不符人体工程学的问题,而总装配时间信号则强制进行装配优化。对三种最强的深度RL方法进行了研究:优势参与者-评论家(A2C)、深度Q学习(DQN)和Rainbow,分别在两种不同的场景下进行研究: 随机和确定性。最后,比较了DRL算法的性能和格子Q学习的性能。在10000个Episode之后,该系统在格子Q学习、A2C和Rainbow算法中实现了接近最优的行为。虽然在更复杂的场景中,格子Q学习算法预计会表现不如其他两种算法。结果支持了深度强化学习在具有人类交互性的装配序列规划问题中的应用潜力。