从显著性到DINO：用于少样本关键点检测的显著性引导视觉Transformer (From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot Keypoint Detection)

Unlike current deep keypoint detectors that are trained to recognize limited number of body parts, few-shot keypoint detection (FSKD) attempts to localize any keypoints, including novel or base keypoints, depending on the reference samples. FSKD requires the semantically meaningful relations for keypoint similarity learning to overcome the ubiquitous noise and ambiguous local patterns. One rescue comes with vision transformer (ViT) as it captures long-range relations well. However, ViT may model irrelevant features outside of the region of interest due to the global attention matrix, thus degrading similarity learning between support and query features. In this paper, we present a novel saliency-guided vision transformer, dubbed SalViT, for few-shot keypoint detection. Our SalViT enjoys a uniquely designed masked self-attention and a morphology learner, where the former introduces saliency map as a soft mask to constrain the self-attention on foregrounds, while the latter leverages the so-called power normalization to adjust morphology of saliency map, realizing ``dynamically changing receptive field''. Moreover, as salinecy detectors add computations, we show that attentive masks of DINO transformer can replace saliency. On top of SalViT, we also investigate i) transductive FSKD that enhances keypoint representations with unlabelled data and ii) FSKD under occlusions. We show that our model performs well on five public datasets and achieves ~10% PCK higher than the normally trained model under severe occlusions.

翻译：与当前训练为识别有限数量的身体部位的深度关键点探测器不同，少样本关键点检测(FSKD)试图根据参考样本定位任何关键点，包括新颖或基本关键点。FSKD需要关键点相似性学习的语义相关性，以克服普遍存在的噪声和模糊的局部模式。Vision Transformer(ViT)能够捕获远距离关系，从而有效地解决了此需求。但是，由于全局注意矩阵，ViT可能会模拟感兴趣区域外的无关特征，因此会降低支持和查询特征之间的相似性学习。在本文中，我们提出了一种新颖的显著性引导视觉Transformer，称为SalViT，用于少样本关键点检测。我们的SalViT拥有独特设计的掩蔽自我注意力和形态学学习器，前者将显著性地图引入软掩蔽，以限制自我注意力在前景上，而后者利用所谓的幂归一化来调整显著性地图的形态，实现“动态变化的感受野”。此外，由于显著性检测器增加了计算量，我们展示了DINO Transformer的注意力掩蔽可以替换显著性掩蔽。在SalViT的基础上，我们还研究了i)增强将未标注数据用于增强关键点表示和ii)遮挡下的FSKD。在五个公共数据集上，我们展示了我们的模型表现优异，并在严重遮挡情况下实现了比通常训练模型高约10％的PCK。