Deep robot vision models are widely used for recognizing objects from camera images, but shows poor performance when detecting objects at untrained positions. Although such problem can be alleviated by training with large datasets, the dataset collection cost cannot be ignored. Existing visual attention models tackled the problem by employing a data efficient structure which learns to extract task relevant image areas. However, since the models cannot modify attention targets after training, it is difficult to apply to dynamically changing tasks. This paper proposed a novel Key-Query-Value formulated visual attention model. This model is capable of switching attention targets by externally modifying the Query representations, namely top-down attention. The proposed model is experimented on a simulator and a real-world environment. The model was compared to existing end-to-end robot vision models in the simulator experiments, showing higher performance and data efficiency. In the real-world robot experiments, the model showed high precision along with its scalability and extendibility.
翻译:深机器人视觉模型被广泛用于从相机图像中识别对象,但显示在未经训练的位置探测对象时的性能不佳。 虽然通过使用大型数据集的培训可以缓解这类问题, 但数据集收集成本不容忽视。 现有的视觉关注模型通过使用一个数据高效结构来解决这个问题, 该结构学会提取任务相关图像区域。 但是, 由于这些模型在培训后无法修改关注目标, 因此很难应用到动态变化的任务中。 本文提出了一个新的 Key- Query-Value 配制的视觉关注模型。 这个模型能够通过外部修改查询表达方式, 即自上而下的关注来转换关注目标。 提议的模型在模拟器和现实世界环境中进行实验。 该模型与模拟器实验中现有的端到端的机器人视觉模型进行了比较, 显示更高的性能和数据效率。 在真实世界的机器人实验中, 该模型显示了高精度及其可扩展性和可扩展性。