Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. To solve this problem, we employ an efficient linear attention for the linear computational complexity. Then, we propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints. To further improve the efficiency, we propose the joint learning of feature matching and description. Our learning enables simpler and faster matching than Sinkhorn, often used in matching the learned descriptors from Transformers. Our method achieves competitive performance with only 0.84M learnable parameters against the bigger SOTAs, SuperGlue (12M parameters) and SGMNet (30M parameters), on three benchmarks, HPatch, ETH, and Aachen Day-Night.
翻译:最近,变异器提供了最先进的匹配功能,对于实现高性能 3D 视觉应用至关重要。 然而,这些变异器由于关注机制的二次计算复杂度而缺乏效率。 为了解决这个问题,我们对线性计算复杂度采用了高效线性关注。 然后,我们提出一个新的关注集合,通过从稀疏关键点汇集全球和地方信息,实现高准确性。 为了进一步提高效率,我们提议联合学习特征匹配和描述。 我们的学习使得比Sinkhorn更简单、更快地匹配功能匹配,而Sinkhorn经常用于匹配来自变异器的学习描述器。 我们的方法在三大基准(HPatch,ETH, 和Aachen Day-Night)上,只有0.84M的SOTA、SuperGlue(12M参数)和SGMNet(30M参数)的可学习参数(30M参数)达到竞争性性能。