Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. Our model effectively captures complex objects and fine-grained details, which are generally difficult to predict. Additionally, it incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure (SSIM) loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the NYU Depth V2 dataset show that our model achieves state-of-the-art performance, with an Absolute Relative Error (ARE) of 0.064, Root Mean Square Error (RMSE) of 0.228, and accuracy ($\delta$ < 1.25) of 89.3%. These metrics demonstrate that our model can accurately predict depth even in challenging scenarios, providing a scalable solution for real-world applications in robotics, 3D reconstruction, and augmented reality.
翻译:暂无翻译