Human learning benefits from multi-modal inputs that often appear as rich semantics (e.g., description of an object's attributes while learning about it). This enables us to learn generalizable concepts from very limited visual examples. However, current few-shot learning (FSL) methods use numerical class labels to denote object classes which do not provide rich semantic meanings about the learned concepts. In this work, we show that by using 'class-level' language descriptions, that can be acquired with minimal annotation cost, we can improve the FSL performance. Given a support set and queries, our main idea is to create a bottleneck visual feature (hybrid prototype) which is then used to generate language descriptions of the classes as an auxiliary task during training. We develop a Transformer based forward and backward encoding mechanism to relate visual and semantic tokens that can encode intricate relationships between the two modalities. Forcing the prototypes to retain semantic information about class description acts as a regularizer on the visual features, improving their generalization to novel classes at inference. Furthermore, this strategy imposes a human prior on the learned representations, ensuring that the model is faithfully relating visual and semantic concepts, thereby improving model interpretability. Our experiments on four datasets and ablation studies show the benefit of effectively modeling rich semantics for FSL.
翻译:人类从多模式投入中学习,这些投入往往被视为丰富的语义学(例如,描述一个物体的属性,同时学习它) 。这使我们能够从非常有限的视觉实例中学习一般概念。然而,目前的微小学习(FSL)方法使用数字类标签来表示不会对所学概念提供丰富的语义含义的物体类。在这项工作中,我们通过使用“类级语言描述,可以以最低的注解成本获得,我们可以改进FSL的性能。根据一套支持和查询,我们的主要想法是创建一个瓶颈视觉特征(湿重原型),然后用来生成课堂语言描述,作为培训期间的辅助任务。我们开发了一个基于前向和后向编码机制的变异形器,将两种模式之间的复杂关系联系起来。我们用原型来保留关于类描述的语义信息,作为视觉特征的正规化,改进对新式课程的概括化。此外,这一战略要求人类在培训期间对课程进行语言描述时,先用一种语言描述语言描述,从而将真实性地展示我们所学的图像学的模型和变现。