在线 3D Bin 与受限制的深强化学习连线包装 (Online 3D Bin Packing with Constrained Deep Reinforcement Learning)

We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.

翻译：我们解决了3D Bin包装问题(3D-BPP 3D Bin包装问题)的具有挑战性但实际有用的变体。在我们的问题中,代理商对要包装在垃圾桶中的物品的信息有限,在物品到达后必须立即包装,而不设缓冲或调整。该物品的放置也受避免碰撞和物理稳定性的限制。我们将这个3D-BPP在线设计成一个限制的Markov决策程序。为了解决问题,我们建议了一种有效且容易实施的限制在行为者-批评框架内深度强化学习(DRL)的方法。特别是,我们引入了一种可行性预测器来预测安置行动的可行性掩码,并使用它来调整行为者在培训期间的行动概率输出。这种监督和转换到DRL有助于代理商有效地学习可行的政策。我们的方法也可以普遍化,例如有能力处理长者或有不同方向的项目。我们进行了广泛的评价,显示所学的政策大大超出最新方法。用户研究表明,我们的方法达到了人的水平。