基于图表的强化学习满足混合整数程序:3D机器人组装发现应用程序 (Graph-based Reinforcement Learning meets Mixed Integer Programs: An application to 3D robot assembly discovery)

Robot assembly discovery is a challenging problem that lives at the intersection of resource allocation and motion planning. The goal is to combine a predefined set of objects to form something new while considering task execution with the robot-in-the-loop. In this work, we tackle the problem of building arbitrary, predefined target structures entirely from scratch using a set of Tetris-like building blocks and a robotic manipulator. Our novel hierarchical approach aims at efficiently decomposing the overall task into three feasible levels that benefit mutually from each other. On the high level, we run a classical mixed-integer program for global optimization of block-type selection and the blocks' final poses to recreate the desired shape. Its output is then exploited to efficiently guide the exploration of an underlying reinforcement learning (RL) policy. This RL policy draws its generalization properties from a flexible graph-based representation that is learned through Q-learning and can be refined with search. Moreover, it accounts for the necessary conditions of structural stability and robotic feasibility that cannot be effectively reflected in the previous layer. Lastly, a grasp and motion planner transforms the desired assembly commands into robot joint movements. We demonstrate our proposed method's performance on a set of competitive simulated RAD environments, showcase real-world transfer, and report performance and robustness gains compared to an unstructured end-to-end approach. Videos are available at https://sites.google.com/view/rl-meets-milp .

翻译：机器人组装发现是一个具有挑战性的问题,存在于资源分配和运动规划的交叉点上。目标是在考虑任务执行的同时,将一组预设的物体组合成新的物体, 以形成一些新的物体。在这项工作中, 我们完全从零开始, 使用一套类似Tetris的建筑块和一个机器人操控器, 解决建设任意的、预设的目标结构的问题。我们的新颖的等级化方法旨在有效地将总体任务分解为三个相互受益的可行水平。在高层次上, 我们运行一个典型的混合整数程序, 以优化块型选择和块的最终配置来重建理想的形状。然后, 其输出被利用来有效指导基础强化学习(RL) 政策的探索。这一 RL 政策从一个灵活的基于图表的表达方式中提取其概括性特性, 通过学习并可以通过搜索加以改进。此外, 它考虑到结构稳定性和机器人可行性的必要条件, 无法在上层得到有效反映。最后, 将预设的组合指令转换成机器人联合运动运动运动的动作。然后, 我们展示了一种具有竞争力的模型化的运行方式, 向世界的模拟环境。我们展示了一种具有竞争力的模拟的模拟的性的业绩环境。模拟到可操作式的模拟到可操作式的模拟到可操作式的模拟式的模拟式的运行环境。