Existing state-of-the-art disparity estimation works mostly leverage the 4D concatenation volume and construct a very deep 3D convolution neural network (CNN) for disparity regression, which is inefficient due to the high memory consumption and slow inference speed. In this paper, we propose a network named EDNet for efficient disparity estimation. Firstly, we construct a combined volume which incorporates contextual information from the squeezed concatenation volume and feature similarity measurement from the correlation volume. The combined volume can be next aggregated by 2D convolutions which are faster and require less memory than 3D convolutions. Secondly, we propose an attention-based spatial residual module to generate attention-aware residual features. The attention mechanism is applied to provide intuitive spatial evidence about inaccurate regions with the help of error maps at multiple scales and thus improve the residual learning efficiency. Extensive experiments on the Scene Flow and KITTI datasets show that EDNet outperforms the previous 3D CNN based works and achieves state-of-the-art performance with significantly faster speed and less memory consumption.
翻译:现有最先进的差异估计工程大多利用四维相容体积,并建造一个非常深的三维进化神经网络(CNN)来进行差异回归,由于内存消耗量高和推断速度慢,这种下降效率低下。在本文中,我们提议建立一个名为EDNet的网络,以便有效地估计差异。首先,我们建立一个综合体积,将挤压的相容体积的背景资料和从相关体积中测量的特征相近性纳入其中。合并体积可以由2D相容体积汇总,这些体积比3D相变体要快,需要的记忆力要少。第二,我们建议一个基于注意的空间残余单元,以产生注意的残余特征。我们运用这种注意机制,在多尺度误差图的帮助下,提供关于不准确区域的直观空间证据,从而提高残余学习效率。关于Scene concult 和KITTI数据集的广泛实验表明,EDNet比前3DCNN的工程和最先进的工作速度和记忆消耗量要快得多。