机身概念学习者：遵循指令进行自我监督学习的概念和映射 (Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following)

Humans, even at a very early age, can learn visual concepts and understand geometry and layout through active interaction with the environment, and generalize their compositions to complete tasks described by natural languages in novel scenes. To mimic such capability, we propose Embodied Concept Learner (ECL) in an interactive 3D environment. Specifically, a robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks by learning purely from human demonstrations and language instructions, without access to ground-truth semantic and depth supervisions from simulations. ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program. ECL has several appealing benefits thanks to its modularized design. Firstly, it enables the robotic agent to learn semantics and depth unsupervisedly acting like babies, e.g., ground concepts through active interaction and perceive depth by disparities when moving forward. Secondly, ECL is fully transparent and step-by-step interpretable in long-term planning. Thirdly, ECL could be beneficial for the embodied instruction following (EIF), outperforming previous works on the ALFRED benchmark when the semantic label is not provided. Also, the learned concept can be reused for other downstream tasks, such as reasoning of object states. Project page: http://ecl.csail.mit.edu/

翻译：人类甚至在很小的年龄就能通过与环境的积极交互学习视觉概念，并且通过将其推广到描述新场景的自然语言来理解几何和布局。为了模仿这样的能力，我们在交互式的3D环境中提出了机身概念学习者（ECL）。具体来说，机器人代理可以通过仅仅通过人类演示和语言指令学习，而没有从模拟中获得语义和深度监督，从而基于语言描述进行视觉概念的接地、构建语义图并规划动作以完成任务。ECL包括：（i）指令解析器，它将自然语言转化为可执行程序；（ii）机身概念学习者，它基于语言描述来接地视觉概念；（iii）地图构造器，它通过利用学习到的概念来估算深度并构建语义地图；以及（iv）具有确定性策略的程序执行器来执行每个程序。ECL由于其模块化的设计具有几个吸引人的优点。首先，它使机器人代理能够像婴儿一样无监督地学习语义和深度，例如通过积极交互接地概念，并在前进时通过视差来感知深度。其次，ECL在长期规划方面是完全透明且逐步可解释的。第三，当不提供语义标签时，ECL可能对机身指令跟踪（EIF）有益，并在ALFRED基准测试中优于先前的作品。此外，学到的概念可以在其他下游任务中应用，例如推理对象状态。项目页面：http://ecl.csail.mit.edu/