Monocular Depth Estimation (MDE) aims to predict pixel-wise depth given a single RGB image. For both, the convolutional as well as the recent attention-based models, encoder-decoder-based architectures have been found to be useful due to the simultaneous requirement of global context and pixel-level resolution. Typically, a skip connection module is used to fuse the encoder and decoder features, which comprises of feature map concatenation followed by a convolution operation. Inspired by the demonstrated benefits of attention in a multitude of computer vision problems, we propose an attention-based fusion of encoder and decoder features. We pose MDE as a pixel query refinement problem, where coarsest-level encoder features are used to initialize pixel-level queries, which are then refined to higher resolutions by the proposed Skip Attention Module (SAM). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range and introduce a Bin Center Predictor (BCP) module that predicts bins at the coarsest level using pixel queries. Apart from the benefit of image adaptive depth binning, the proposed design helps learn improved depth embedding in initial pixel queries via direct supervision from the ground truth. Extensive experiments on the two canonical datasets, NYUV2 and KITTI, show that our architecture outperforms the state-of-the-art by 5.3% and 3.9%, respectively, along with an improved generalization performance by 9.4% on the SUNRGBD dataset. Code is available at https://github.com/ashutosh1807/PixelFormer.git.
翻译:单 RGB 图像中显示的注意好处, 我们建议以关注为基础整合解码器和解码器功能。 对于这两种模型, 以编码器- 解码器为基础的架构都被认为是有用的, 因为同时需要全球背景和像素级分辨率。 通常, 一个跳过连接模块用于连接编码器和解码器的特性, 其中包括功能图解析, 并随后进行熔化操作。 在大量计算机视觉问题中, 我们建议以注意力为基础整合为焦点, 并配置基于关注的编码和解码的功能。 我们将MDE作为像素查询的改进问题, 用于初始化像素级查询, 并随后通过拟议跳过注意模块( SAM) 来精细化更高级的解析器和解码。 我们将预测问题作为分解连续深度范围并引入 Bin Center 模拟( BCP ) 模块, 以在读取读取系统化系统化的精度模型, 通过深度测试, 直接地分析系统测试, 将数据转换到系统升级的图像。