We present three multi-scale similarity learning architectures, or DeepSim networks. These models learn pixel-level matching with a contrastive loss and are agnostic to the geometry of the considered scene. We establish a middle ground between hybrid and end-to-end approaches by learning to densely allocate all corresponding pixels of an epipolar pair at once. Our features are learnt on large image tiles to be expressive and capture the scene's wider context. We also demonstrate that curated sample mining can enhance the overall robustness of the predicted similarities and improve the performance on radiometrically homogeneous areas. We run experiments on aerial and satellite datasets. Our DeepSim-Nets outperform the baseline hybrid approaches and generalize better to unseen scene geometries than end-to-end methods. Our flexible architecture can be readily adopted in standard multi-resolution image matching pipelines.
翻译:我们提出了三种多尺度相似性学习体系结构,即DeepSim网络。这些模型使用对比损失学习像素级匹配,并且不考虑所考虑场景的几何形状。我们在深入分析一对立体影像时,在深度上学习了一种中间地带,通过学习一次性为每个对应像素分配一个密集的匹配标记来实现。我们在大型图像瓦片上学习特征,以使其具有表现力并捕获场景的更广泛上下文。我们还展示了精心策划的样本挖掘如何增强预测的相似性的整体鲁棒性,并提高对辐射同质区域性能的表现。我们在航空和卫星数据集上进行实验。我们的DeepSim-Nets超越了基线混合方法,并且比端到端方法更好地推广到看不见的场景几何形状。我们的灵活的体系结构可以很容易地采用标准多分辨率图像匹配流程。