Depth estimation is an important computer vision task, useful in particular for navigation in autonomous vehicles, or for object manipulation in robotics. Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike. More specifically, we used the Multi Vehicle Stereo Event Camera Dataset (MVSEC). It provides a depth ground-truth, which was used to train StereoSpike in a supervised manner, using surrogate gradient descent. We propose a novel readout paradigm to obtain a dense analog prediction -- the depth of each pixel -- from the spikes of the decoder. We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy. To the best of our knowledge, it is the first time that such a large-scale regression problem is solved by a fully spiking network. Finally, we show that low firing rates (<10%) can be obtained via regularization, with a minimal cost in accuracy. This means that StereoSpike could be efficiently implemented on neuromorphic chips, opening the door for low power and real time embedded systems.
翻译:深度估算是一项重要的计算机愿景任务, 特别是对于自主飞行器的导航或机器人的天体操纵非常有用。 我们在这里使用端到端神经形态方法解决了这一问题, 将两个基于事件的相机和一个Spiking神经网络(SNN)与略微修改的 U-Net 类似编码器脱coder- decoder 结构结合起来, 我们称之为StereoSpike 。 更具体地说, 我们使用了多车辆立体事件相机数据集( MVSEC ) 。 它提供了一个深度地面真相, 用来对StereoSpike以监督的方式培训StereoSpikike, 使用代理梯度下降。 我们提出了一个新颖的读出模式, 以获得从解码器的顶峰得到密集的模拟预测 -- -- 每一个像素的深度。 我们证明, 这个结构非常普及, 甚至优于非跳跃式的对应方, 导致最先进的测试准确性。 根据我们所知, 这是第一次, 如此大规模的回归问题可以通过完全的网络来解决, 。 最后, 我们展示一个低级的系统可以被稳定到最低的系统。