The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be released upon acceptance.
翻译:目标是识别视频中每个人的动作发生的时间和地点,并分类相应的动作类别,称为时空行动检测。现有的大多数方法采用全监督学习,需要大量的训练数据,使得难以实现零样本学习。在本文中,我们提出利用预先训练的视觉语言模型提取代表性的图像和文本特征,并通过不同的交互模块建模这些特征之间的关系,以获取交互特征。此外,我们使用此特征提示每个标签以获得更适当的文本特征。最后,我们计算每个标签的交互特征和文本特征之间的相似性来确定动作类别。我们在J-HMDB和UCF101-24数据集上的实验表明,所提出的交互模块和提示使视觉语言特征更加对齐,从而实现了零样本时空行动检测的卓越准确性。代码将在被接受后发布。