Stereo correspondence matching is an essential part of the multi-step stereo depth estimation process. This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network. The proposed algorithm is entitled as 2T-UNet. The idea behind 2T-UNet is to replace cost volume construction with twin convolution towers. These towers have an allowance for different weights between them. Additionally, the input for twin encoders in 2T-UNet are different compared to the existing stereo methods. Generally, a stereo network takes a right and left image pair as input to determine the scene geometry. However, in the 2T-UNet model, the right stereo image is taken as one input and the left stereo image along with its monocular depth clue information, is taken as the other input. Depth clues provide complementary suggestions that help enhance the quality of predicted scene geometry. The 2T-UNet surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene flow dataset, both quantitatively and qualitatively. The architecture performs incredibly well on complex natural scenes, highlighting its usefulness for various real-time applications. Pretrained weights and code will be made readily available.
翻译:立体通信匹配是多步立体深度估测过程的一个基本部分。 本文重审深度估测问题, 避免立体匹配的明显步数, 使用简单的双向共振神经网络。 拟议的算法称为 2T- UNet 。 2T- UNet 的理念是用双相振动塔取代成本量的构造。 这些塔可以容纳不同的重量。 此外, 2T- UNet 中双向编码器的输入与现有立体方法不同。 一般而言, 立体网络以右对和左图像配对作为确定场景几何的输入。 但是, 在 2T- UNet 模型中, 右立体图像被视为一个输入, 左立体图像连同其单形深度线索信息, 作为其他输入。 深度线索提供了补充建议, 帮助提高预测场景的测度质量。 2T- UNet 的输入超过现有最新单色和立体深度估测方法。 在具有挑战性的立体流数据集中, 定量和定性的测图将具有惊人的重度。 在复杂的自然场景上, 将显示其可轻易应用。