Recently, few-shot video classification has received an increasing interest. Current approaches mostly focus on effectively exploiting the temporal dimension in videos to improve learning under low data regimes. However, most works have largely ignored that videos are often accompanied by rich textual descriptions that can also be an essential source of information to handle few-shot recognition cases. In this paper, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Furthermore, our model follows a transductive setting to improve the task-adaptation ability of the model by using the support textual descriptions and query instances to update a set of class prototypes. Our model achieves state-of-the-art performance on four challenging benchmarks commonly used to evaluate few-shot video action classification models.
翻译:最近,少数视频分类受到越来越多的关注。目前的方法大多侧重于有效利用视频的时间维度来改进低数据制度下的学习。然而,大多数工作基本上忽视了视频往往伴有丰富的文本描述,这些描述也可能成为处理几发识别案例的基本信息来源。在本文件中,我们提议在培训几发视频分类模式时,利用这些由人类提供的文本描述作为特有信息。具体地说,我们制定了基于文本的任务调节器,以调整视频特质以适应微光学习任务。此外,我们的模型遵循一个传输设置,通过使用支持文本描述和查询实例更新一组类原型来改进模型的任务适应能力。我们的模型在评估几发视频行动分类模式时通常使用的四项挑战性基准上取得了最新业绩。