Single image-level annotations only correctly describe an often small subset of an image's content, particularly when complex real-world scenes are depicted. While this might be acceptable in many classification scenarios, it poses a significant challenge for applications where the set of classes differs significantly between training and test time. In this paper, we take a closer look at the implications in the context of $\textit{few-shot learning}$. Splitting the input samples into patches and encoding these via the help of Vision Transformers allows us to establish semantic correspondences between local regions across images and independent of their respective class. The most informative patch embeddings for the task at hand are then determined as a function of the support set via online optimization at inference time, additionally providing visual interpretability of `$\textit{what matters most}$' in the image. We build on recent advances in unsupervised training of networks via masked image modelling to overcome the lack of fine-grained labels and learn the more general statistical structure of the data while avoiding negative image-level annotation influence, $\textit{aka}$ supervision collapse. Experimental results show the competitiveness of our approach, achieving new state-of-the-art results on four popular few-shot classification benchmarks for $5$-shot and $1$-shot scenarios.
翻译:单一图像级别说明仅正确描述图像内容中通常很少的一小部分内容, 特别是在描述复杂的真实世界场景时。 虽然在许多分类设想中, 这也许可以被接受, 但对于一系列班级在培训和测试时间之间差异很大的应用来说, 却是一个巨大的挑战。 在本文中, 我们更仔细地审视$\ textit{ few-shot learning} 的影响。 将输入样本分解成补丁, 并在视野变异器的帮助下将这些样本编码起来, 使我们能够在本地区域之间建立图像之间的语义通信, 并且独立于各自的类别。 手头任务最丰富的信息化补丁会被确定为通过在线优化推论时间设定的支持功能, 额外提供图像中“ $\ textit{ what what matter} $的视觉可解释性 。 我们借助最近通过蒙蔽图像建模对网络进行的不受监督的培训, 克服微缩略标签的缺失, 并学习数据更普遍的统计结构, 同时避免负面的图像级影响, $\ textital {ak_ abas- squtal- pass assing the falomate image view makedulock shage makedustragement