While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards. Indeed, without shaped rewards, i.e., with only terminal rewards, present-day Embodied AI results degrade significantly across Embodied AI problems from single-agent Habitat-based PointGoal Navigation (SPL drops from 55 to 0) and two-agent AI2-THOR-based Furniture Moving (success drops from 58% to 1%) to three-agent Google Football-based 3 vs. 1 with Keeper (game score drops from 0.6 to 0.1). As training from shaped rewards doesn't scale to more realistic tasks, the community needs to improve the success of training with terminal rewards. For this we propose GridToPix: 1) train agents with terminal rewards in gridworlds that generically mirror Embodied AI environments, i.e., they are independent of the task; 2) distill the learned policy into agents that reside in complex visual worlds. Despite learning from only terminal rewards with identical models and RL algorithms, GridToPix significantly improves results across tasks: from PointGoal Navigation (SPL improves from 0 to 64) and Furniture Moving (success improves from 1% to 25%) to football gameplay (game score improves from 0.1 to 0.6). GridToPix even helps to improve the results of shaped reward training.
翻译:虽然深加学习(RL)有望摆脱手贴数据,但巨大的成功,特别是对Embudided AI而言,要求通过精心塑造的奖励来建立监管。 事实上,如果没有形状化的奖励,即只有终极奖励,当今的Embodied AI结果在Embodied AI问题中显著下降,从单一试办的基于人居的点目标导航(SPL从55到0下降)和双试办的AI2-THOR 家具移动(从58%下降到1 % ) 至三个试办的Google足球3比1,这要求通过仔细塑造的奖励(游戏得分从0.6下降到0.1 ) 进行监管。由于从形状化的奖励到更现实化的任务,社区需要提高培训的成功程度。 为此,我们提议GreatToPix:(1) 在网格世界中培训具有最终回报的代理,这种回报一般镜像的Embuded AI 环境,即他们独立于任务;(2) 将所学的政策转化为复杂的视觉世界中的代理。尽管只从相同的模型和RLSOL RGL RV 改进了方向的终端结果,但GRIPix改进了从1到25(GIS 改进了方向,GIS 改进了。