Learning to control a robot commonly requires mapping between robot states and camera images, where conventional deep vision models require large training dataset. Existing visual attention models, such as Deep Spatial Autoencoders, have improved the data-efficiency by training the model to selectively extract only the task relevant image area. However, since the models are unable to select attention targets on demand, the diversity of trainable tasks are limited. This paper proposed a novel Key-Query-Value formulated visual attention model which can be guided to a certain attention target. The model creates an attention heatmap from Key and Query, and selectively extracts the attended data represented in Value. Such structure is capable of incorporating external inputs to create the Query, which will be trained to represent the target objects. The separation of Query creation improved the model's flexibility, enabling to simultaneously obtain and switch between multiple targets in a top-down manner. The proposed model is experimented on a simulator and a real-world environment, showing better performance compared to existing end-to-end robot vision models. The results of real-world experiments indicated the model's high scalability and extendiblity on robot controlling tasks.
翻译:控制机器人通常需要在机器人状态和摄像图像之间绘制地图,而传统的深视模型需要大型培训数据集。现有的视觉关注模型,如深空间自动计算器,通过培训模型,仅有选择地提取任务相关图像领域,提高了数据效率。然而,由于模型无法根据需求选择关注目标,因此,培训任务的多样性是有限的。本文提议了一个新的 Key-Query-Value 配制视觉关注模型,该模型可以引导到某种关注目标。该模型从密钥和查询中产生关注热映,并有选择地提取了数值中包含的数据。这种结构能够将外部投入纳入创建Query,该模型将经过培训以代表目标目标对象。由于对创建的分离提高了模型的灵活性,从而能够同时以自上而下的方式获取和切换多个目标。该拟议模型在模拟和真实世界环境中进行实验,与现有的端对端机器人视觉模型相比,显示更好的性能。这种模型的实验结果显示模型的高可控性和扩展性。