Depth estimation from a single image is of paramount importance in the realm of computer vision, with a multitude of applications. Conventional methods suffer from the trade-off between consistency and fine-grained details due to the local-receptive field limiting their practicality. This lack of long-range dependency inherently comes from the convolutional neural network part of the architecture. In this paper, a dual window transformer-based network, namely DwinFormer, is proposed, which utilizes both local and global features for end-to-end monocular depth estimation. The DwinFormer consists of dual window self-attention and cross-attention transformers, Dwin-SAT and Dwin-CAT, respectively. The Dwin-SAT seamlessly extracts intricate, locally aware features while concurrently capturing global context. It harnesses the power of local and global window attention to adeptly capture both short-range and long-range dependencies, obviating the need for complex and computationally expensive operations, such as attention masking or window shifting. Moreover, Dwin-SAT introduces inductive biases which provide desirable properties, such as translational equvariance and less dependence on large-scale data. Furthermore, conventional decoding methods often rely on skip connections which may result in semantic discrepancies and a lack of global context when fusing encoder and decoder features. In contrast, the Dwin-CAT employs both local and global window cross-attention to seamlessly fuse encoder and decoder features with both fine-grained local and contextually aware global information, effectively amending semantic gap. Empirical evidence obtained through extensive experimentation on the NYU-Depth-V2 and KITTI datasets demonstrates the superiority of the proposed method, consistently outperforming existing approaches across both indoor and outdoor environments.
翻译:DwinFormer 是一个基于双窗口变压器的网络, 即 DwinFormer, 它在计算机视野领域具有至关重要的意义, 并且有许多应用。 DwinFormer 由两个窗口的自控和超端变压器、 Dwin- SAT 和 Dwin-CAT 分别构成。 Dwin- SAT 在同时捕捉全球环境的同时, 也从复杂的、 本地意识的特征中提取。 它利用当地和全球窗口的注意力, 从而恰当地捕捉到短距离和长距离的依赖性, 从而忽略了对复杂和计算成本昂贵的操作的需求, 比如, 掩盖或改变地方内部的深度。 DwinFormer 由双窗口自控和双端的双端变压变压器、 Dwin- SAT 和 Dwin- CAT 分别构成。 Dwin- 双向SAT 在同时捕捉取全球环境环境的同时, 在本地有清晰感知觉的特性的同时, 将精密性地取出本地的特性。 它利用本地和远端变压式变压法,, 使得全球的变形变形变形变形法和不为大的数据 。</s>