Active vision is inherently attention-driven: The agent actively selects views to attend in order to fast achieve the vision task while improving its internal representation of the scene being observed. Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we propose to address the multi-view depth-based active object recognition using attention mechanism, through developing an end-to-end recurrent 3D attentional network. The architecture takes advantage of a recurrent neural network (RNN) to store and update an internal representation. Our model, trained with 3D shape datasets, is able to iteratively attend to the best views targeting an object of interest for recognizing it. To realize 3D view selection, we derive a 3D spatial transformer network which is differentiable for training with backpropagation, achieving much faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method, with only depth input, achieves state-of-the-art next-best-view performance in time efficiency and recognition accuracy.
翻译:活动性愿景本质上是关注驱动的: 代理商积极选择要参与的观点, 以快速实现愿景任务, 同时改善所观测场景的内部代表性。 在基于单一 RGB 图像的基于关注的 2D 愿景任务模型最近的成功激励下, 我们提议通过关注机制, 开发一个端到端的经常性 3D 关注网络, 解决基于多视角的深度主动对象识别问题。 建筑利用一个经常性神经网络存储和更新内部代表。 我们的3D 形状数据集培训模型能够反复关注针对一个感兴趣的对象的最佳观点, 从而识别它。 为了实现 3D 视图选择, 我们产生了一个3D 空间变压器网络, 可用于反向调整培训, 比大多数现有基于关注的模型使用的强化学习速度快得多。 实验显示, 我们的方法只有深度投入, 才能在时间效率和识别精准度上实现最先进的次最佳表现。