Few-Shot learning aims to train and optimize a model that can adapt to unseen visual classes with only a few labeled examples. The existing few-shot learning (FSL) methods, heavily rely only on visual data, thus fail to capture the semantic attributes to learn a more generalized version of the visual concept from very few examples. However, it is a known fact that human visual learning benefits immensely from inputs from multiple modalities such as vision, language, and audio. Inspired by the human learning nature of encapsulating the existing knowledge of a visual category which is in the form of language, we introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts for few-shot learning. Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category from a strong textual encoder in addition to the existing training mechanism. Hence, the approach is more generalized and can be plugged into any existing FSL method. The pre-trained semantic feature extractor (learned from a large-scale text corpora) we use in our approach provides a strong contextual prior knowledge to assist FSL. The experimental results done in popular FSL datasets show that our approach is generic in nature and provides a strong boost to the existing FSL baselines.
翻译:少许浅小的学习旨在培训和优化能够适应隐形视觉类的模型,只有几个有标签的例子。现有的短片学习方法(FSL)非常依赖视觉数据,严重依赖视觉数据,因此无法捕捉语义属性,以便从极少数例子中学习更广义的视觉概念。然而,众所周知,人类视觉学习从视觉、语言和听觉等多种模式的投入中获益匪浅。受以语言形式包装视觉类现有知识的人类学习性质所启发,我们为视觉和语义特性矢量引入了对比性调整机制,以学习更普遍的视觉概念,进行少片学习。我们的方法只是增加了一个辅助性对比性学习目标,在现有的培训机制之外,从强的文字编码中捕捉视觉类的背景知识。因此,这种方法更为普及,可以插入任何现有的FSL方法。我们使用预先训练的语义特征提取器(从一个大尺度的文字体形体)提供了一种强大的背景背景化的视觉概念化概念,有助于我们现有的FSSSSSL的现有实验结果。