In this work, we consider the problem of learning a perception model for monocular robot navigation using few annotated images. Using a Vision Transformer (ViT) pretrained with a label-free self-supervised method, we successfully train a coarse image segmentation model for the Duckietown environment using 70 training images. Our model performs coarse image segmentation at the 8x8 patch level, and the inference resolution can be adjusted to balance prediction granularity and real-time perception constraints. We study how best to adapt a ViT to our task and environment, and find that some lightweight architectures can yield good single-image segmentation at a usable frame rate, even on CPU. The resulting perception model is used as the backbone for a simple yet robust visual servoing agent, which we deploy on a differential drive mobile robot to perform two tasks: lane following and obstacle avoidance.
翻译:在这项工作中,我们考虑的是使用几个附加说明的图像来学习单体机器人导航的感知模型的问题。我们使用一种没有标签的自我监督方法预先训练的视野变异器(VIT)成功地用70个培训图像为 Duckietown 环境训练了一个粗糙的图像分解模型。我们的模型在8x8 补丁水平上进行粗糙的图像分解,并且可以调整推理分辨率,以平衡预测颗粒性和实时感知限制。我们研究如何最好地使维特适应我们的任务和环境,并发现一些轻量结构能够以可用框架速产生好的单一图像分解,即使是在CPU上。由此产生的视觉模型被用来作为简单而稳健的视觉分解剂的骨干,我们用这种分流机器人执行两种任务:跟踪和避免障碍。