The gap between simulation and the real-world restrains many machine learning breakthroughs in computer vision and reinforcement learning from being applicable in the real world. In this work, we tackle this gap for the specific case of camera-based navigation, formulating it as following a visual cue in the foreground with arbitrary backgrounds. The visual cue in the foreground can often be simulated realistically, such as a line, gate or cone. The challenge then lies in coping with the unknown backgrounds and integrating both. As such, the goal is to train a visual agent on data captured in an empty simulated environment except for this foreground cue and test this model directly in a visually diverse real world. In order to bridge this big gap, we show it's crucial to combine following techniques namely: Randomized augmentation of the fore- and background, regularization with both deep supervision and triplet loss and finally abstraction of the dynamics by using waypoints rather than direct velocity commands. The various techniques are ablated in our experimental results both qualitatively and quantitatively finally demonstrating a successful transfer from simulation to the real world.
翻译:模拟与现实世界之间的差距限制了计算机视觉和强化学习方面的许多机器学习突破,使其无法在现实世界中应用。 在这项工作中,我们解决了以相机为基础的导航这一具体案例的这一差距,将这一差距描绘成在有任意背景的前台的视觉提示。 前景下的视觉提示通常可以现实地模拟, 如线条、 门或锥形。 挑战就在于应对未知背景, 以及两者的结合。 因此, 目标是在空模拟环境中采集的数据上培养视觉代理器, 除了这个前台提示之外, 直接测试这个模型, 并在一个视觉多样的现实世界中测试这个模型。 为了弥补这一巨大差距, 我们展示了将以下技术结合起来的关键: 将前台和后台随机放大, 由深度的监督和三重损失进行正规化, 最后通过使用路点而不是直接速度指令来抽象动态。 各种技术在我们的实验结果中被从质量上和数量上抹去, 最终展示了从模拟到真实世界的成功转移。