We study the task of semantic mapping - specifically, an embodied agent (a robot or an egocentric AI assistant) is given a tour of a new environment and asked to build an allocentric top-down semantic map ("what is where?") from egocentric observations of an RGB-D camera with known pose (via localization sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame, (2) a Feature Projector that projects egocentric features to appropriate locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan length x width x feature-dims that learns to accumulate projected egocentric features, and (4) a Map Decoder that uses the memory tensor to produce semantic top-down maps. SMNet combines the strengths of (known) projective camera geometry and neural representation learning. On the task of semantic mapping in the Matterport3D dataset, SMNet significantly outperforms competitive baselines by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1 metrics. Moreover, we show how to use the neural episodic memories and spatio-semantic allocentric representations build by SMNet for subsequent tasks in the same space - navigating to objects seen during the tour("Find chair") or answering questions about the space ("How many chairs did you see in the house?"). Project page: https://vincentcartillier.github.io/smnet.html.
翻译:我们研究了语义映射任务 — 具体地说, 一个包含的代理人( 机器人或以自我为中心的AI 助理) 被授予了一个新的环境, 并被要求从一个有已知姿势的 RGB- D 相机的自我中心观察( 通过本地化传感器) 中建立一个直观的语义映射图(“ 在哪里 ”? ) 。 为了实现这一目标, 我们展示了语义映射网( SMNet), 它由以下组成:(1) 一个以Egocentcent 为中心的视觉映射器, 将每个以自我为中心的 RGB- D 框架编码, (2) 一个功能投影的精度, 将自我中心特性投射到一个合适的地平面的物体上, (3) 一个以平面的平面的平面的平面缩放存储存储器 : x宽度的空间内存储存储器 ; (4) 一个使用记忆的解映射仪, 用于生成语义上自控图解图解图解的长。 SMS- slod- laveal sal sal sal saltitutions laveal laveal sal sal saltidudududududududududududude.