We train embodied neural networks to plan and navigate unseen complex 3D environments, emphasising real-world deployment. Rather than requiring prior knowledge of the agent or environment, the planner learns to model the state transitions and rewards. To avoid the potentially hazardous trial-and-error of reinforcement learning, we focus on differentiable planners such as Value Iteration Networks (VIN), which are trained offline from safe expert demonstrations. Although they work well in small simulations, we address two major limitations that hinder their deployment. First, we observed that current differentiable planners struggle to plan long-term in environments with a high branching complexity. While they should ideally learn to assign low rewards to obstacles to avoid collisions, we posit that the constraints imposed on the network are not strong enough to guarantee the network to learn sufficiently large penalties for every possible collision. We thus impose a structural constraint on the value iteration, which explicitly learns to model any impossible actions. Secondly, we extend the model to work with a limited perspective camera under translation and rotation, which is crucial for real robot deployment. Many VIN-like planners assume a 360 degrees or overhead view without rotation. In contrast, our method uses a memory-efficient lattice map to aggregate CNN embeddings of partial observations, and models the rotational dynamics explicitly using a 3D state-space grid (translation and rotation). Our proposals significantly improve semantic navigation and exploration on several 2D and 3D environments, succeeding in settings that are otherwise challenging for this class of methods. As far as we know, we are the first to successfully perform differentiable planning on the difficult Active Vision Dataset, consisting of real images captured from a robot.
翻译:我们训练包含神经网络以规划和导航看不见的复杂 3D 环境,强调现实世界的部署。 计划者不要求事先了解代理人或环境,而是学习模拟州际过渡和奖励。 为了避免可能危险的强化学习试验和危险,我们注重不同的规划者,如通过安全专家演示培训而脱机的价值观透气网络(VIN),尽管它们在小型模拟中运作良好,但我们解决了阻碍其部署的两大限制。 首先,我们发现目前不同的规划者在高度分流复杂的环境中努力规划长期任务。 虽然他们最好学会为避免碰撞的障碍分配低的奖励,但我们认为对网络施加的制约不够强大,无法保证网络在每一次可能的碰撞中都有足够的惩罚。 因此我们从结构上制约了它的价值,它明确学会模拟任何不可能的行动。 其次,我们扩展模型,在翻译和轮换中以有限的视角摄像头工作,这对真正的机器人部署至关重要。 许多VIN- 类似规划者在远端D 将一个360度或远端的轨道定位中选择一个远端的轨道定位模型, 也就是一个连续3级的轨道定位模型, 。 对比我们使用一个直观的轨道和直观的轨道定位的轨道定位的模型, 。 。 使用一个直径直径方的模型, 。 使用一个直径方的路径,我们使用一个直方的轨道的轨道的轨道的轨道的模型, 直观和直观的模型, 直径方的模型使用一个直径方的模型, 。