Single Image Depth Prediction Made Better: A Multivariate Gaussian Take (Single Image Depth Prediction Made Better: A Multivariate Gaussian Take)

Neural-network-based single image depth prediction (SIDP) is a challenging task where the goal is to predict the scene's per-pixel depth at test time. Since the problem, by definition, is ill-posed, the fundamental goal is to come up with an approach that can reliably model the scene depth from a set of training examples. In the pursuit of perfect depth estimation, most existing state-of-the-art learning techniques predict a single scalar depth value per-pixel. Yet, it is well-known that the trained model has accuracy limits and can predict imprecise depth. Therefore, an SIDP approach must be mindful of the expected depth variations in the model's prediction at test time. Accordingly, we introduce an approach that performs continuous modeling of per-pixel depth, where we can predict and reason about the per-pixel depth and its distribution. To this end, we model per-pixel scene depth using a multivariate Gaussian distribution. Moreover, contrary to the existing uncertainty modeling methods -- in the same spirit, where per-pixel depth is assumed to be independent, we introduce per-pixel covariance modeling that encodes its depth dependency w.r.t all the scene points. Unfortunately, per-pixel depth covariance modeling leads to a computationally expensive continuous loss function, which we solve efficiently using the learned low-rank approximation of the overall covariance matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows state-of-the-art results. Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard.

翻译：---- 单张图像深度预测变得更好：多元高斯方法神经网络基础上的单张图像深度预测（SIDP）是一项具有挑战性的任务，其目标是在测试时预测场景的每个像素的深度。由于问题的本质，即定义不明确，因此根本目标是提出一种可靠地从一组训练示例中对场景深度进行建模的方法。在追求完美深度估计的过程中，大多数现有的最先进学习技术预测每像素的单个标量深度值。然而，众所周知，训练模型具有精度限制，并且可能预测不准确的深度。因此，SIDP方法必须意识到模型预测的期望深度变化。因此，我们引入了一种方法，该方法执行每像素深度的连续建模，其中我们可以预测和推理每像素深度及其分布。为此，我们使用多元高斯分布对每像素场景深度进行建模。此外，与现有的不确定性建模方法不同--以相同的精神为例，其中每像素深度被假定为独立的，我们引入了每像素协方差建模，该建模编码了每像素深度相对于所有场点的深度依赖性。不幸的是，每像素深度协方差建模会导致计算上昂贵的连续损失函数，我们使用学习到的协方差矩阵的低秩近似有效地解决了这个问题。值得注意的是，在诸如KITTI，NYU和SUN-RGB-D等基准数据集上进行测试时，通过优化我们的损失函数获得的SIDP模型显示出最先进的结果。我们方法的准确度（称为MG）在KITTI深度预测基准排行榜上居于前列。