Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability.
翻译:对象重新排列是将对象从初始状态移到目标状态。 这里, 我们的焦点是将对象重新排列的更实际设置, 即将对象从打乱的布局重新排列为规范性的目标分布, 没有明确的目标规格。 但是, AI 代理仍然具有挑战性, 因为很难描述奖赏工程或收集专家轨迹的目标分配( 目标规格) 作为演示。 因此, 直接使用强化学习或仿造学习算法来完成任务是行不通的。 本文的目的在于寻找一项政策, 仅以一组来自目标分布的示例而不是手动奖赏功能为例。 我们使用得分比对等目标目标目标重新排列为规范性目标分布目标分布。 我们使用得分比对等目标目标重新排列为规范性目标分布目标分布目标分布。 但是, 这对于AI 代理代理商来说, 很难描述目标分配的工程分配目标分配( 目标规格) 目标分配目标分配( 目标区分) 或收集专家轨迹。 因此, 无法直接使用基于模型的规划, 我们可以将目标梯度控制与输出动作 。 对于示范强化学习学习, TarGF 不只是用于 用于 选择 质量变化 工具 。