Prompt tuning with large-scale pretrained vision-language models empowers open-vocabulary predictions trained on limited base categories, e.g., object classification and detection. In this paper, we propose compositional prompt tuning with motion cues: an extended prompt tuning paradigm for compositional predictions of video data. In particular, we present Relation Prompt (RePro) for Open-vocabulary Video Visual Relation Detection (Open-VidVRD), where conventional prompt tuning is easily biased to certain subject-object combinations and motion patterns. To this end, RePro addresses the two technical challenges of Open-VidVRD: 1) the prompt tokens should respect the two different semantic roles of subject and object, and 2) the tuning should account for the diverse spatio-temporal motion patterns of the subject-object compositions. Without bells and whistles, our RePro achieves a new state-of-the-art performance on two VidVRD benchmarks of not only the base training object and predicate categories, but also the unseen ones. Extensive ablations also demonstrate the effectiveness of the proposed compositional and multi-mode design of prompts. Code is available at https://github.com/Dawn-LX/OpenVoc-VidVRD.
翻译:快速调整大规模预先培训的视觉语言模型(Open-VidVRD),常规快速调整很容易偏向某些主题对象组合和运动模式。为此,RePro应对开放-VidVRD的两种技术挑战:1)即时标牌应尊重主题和对象的两种不同的语义作用,2)调整应说明主题对象构成的多种语义-时间运动模式。我们不做铃声和哨声,我们的RePro不仅在基础培训对象和上游类别的两个维德VRD基准上实现了新的状态-艺术性能,而且还在MARV-RD中展示了拟议的M/RBRD/MRD的快速设计。