Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy.
翻译:微小行动识别的当前方法主要属于ProtoNet之后的衡量学习框架,但是,它们要么忽视代表性原型的影响,要么没有充分地用多式信息加强原型。在这项工作中,我们提议建立一个新型多式原型-强化网络(MORN),将标签文本的语义信息用作多式联运信息,以加强原型,包括两种模式流。在视觉流中引入了CLIP视觉编码器,并且由Temal-Relational Transferent(TRX)模块计算了视觉原型。在文本流中引入了一个冻结的CLIP文本编码器,并使用了一个语义强化模块来增强文本特征。在加固后,获得了文本原型。最后的多式联运原型由一个多式联运原型增强模块进行计算。此外,没有评估原型质量的评价指标。据我们所知,我们首先提出一个名为Prototype 类似性能差异值的原型评价(PR51 PR),该模型还用于评估MAR-B 4 和MUR 不同类型数据的业绩。