Classical Visual Servoing (VS) rely on handcrafted visual features, which limit their generalizability. Recently, a number of approaches, some based on Deep Neural Networks, have been proposed to overcome this limitation by comparing directly the entire target and current camera images. However, by getting rid of the visual features altogether, those approaches require the target and current images to be essentially similar, which precludes the generalization to unknown, cluttered, scenes. Here we propose to perform VS based on visual features as in classical VS approaches but, contrary to the latter, we leverage recent breakthroughs in Deep Learning to automatically extract and match the visual features. By doing so, our approach enjoys the advantages from both worlds: (i) because our approach is based on visual features, it is able to steer the robot towards the object of interest even in presence of significant distraction in the background; (ii) because the features are automatically extracted and matched, our approach can easily and automatically generalize to unseen objects and scenes. In addition, we propose to use a render engine to synthesize the target image, which offers a further level of generalization. We demonstrate these advantages in a robotic grasping task, where the robot is able to steer, with high accuracy, towards the object to grasp, based simply on an image of the object rendered from the camera view corresponding to the desired robot grasping pose.
翻译:经典视觉图像(VS) 依赖手工制作的视觉特征,这些特征限制了其普遍性。最近,一些方法,有些基于深神经网络,被提议通过直接比较整个目标图像和当前摄像图像来克服这一局限性。然而,通过完全清除视觉特征,这些方法要求目标和当前图像基本上相似,从而排除了对未知的、杂乱的场景的概括化。我们在这里建议根据传统VS方法中的视觉特征来显示VS,但与后者相反,我们利用深学习的最新突破来自动提取和匹配视觉特征。通过这样做,我们的方法从两个世界都获得了优势:(一) 由于我们的方法以视觉特征为基础,它能够将机器人引向感兴趣的对象,即使背景中有很大的分心;(二) 由于这些特征是自动提取和匹配的,我们的方法可以容易和自动地对看不见的物体和场景进行概括化。此外,我们提议使用一个化的引擎来合成目标图像,从更接近一般化的程度。我们这样做,我们的方法从两个世界都享有优势:(一) 由于我们的方法基于视觉特征,我们的方法能够引导机器人的高度的优势,从而掌握掌握一个高清晰的图像,从而能够掌握一个掌握一个高层次。