We summarize our TRECVID 2022 Ad-hoc Video Search (AVS) experiments. Our solution is built with two new techniques, namely Lightweight Attentional Feature Fusion (LAFF) for combining diverse visual / textual features and Bidirectional Negation Learning (BNL) for addressing queries that contain negation cues. In particular, LAFF performs feature fusion at both early and late stages and at both text and video ends to exploit diverse (off-the-shelf) features. Compared to multi-head self attention, LAFF is much more compact yet more effective. Its attentional weights can also be used for selecting fewer features, with the retrieval performance mostly preserved. BNL trains a negation-aware video retrieval model by minimizing a bidirectionally constrained loss per triplet, where a triplet consists of a given training video, its original description and a partially negated description. For video feature extraction, we use pre-trained CLIP, BLIP, BEiT, ResNeXt-101 and irCSN. As for text features, we adopt bag-of-words, word2vec, CLIP and BLIP. Our training data consists of MSR-VTT, TGIF and VATEX that were used in our previous participation. In addition, we automatically caption the V3C1 collection for pre-training. The 2022 edition of the TRECVID benchmark has again been a fruitful participation for the RUCMM team. Our best run, with an infAP of 0.262, is ranked at the second place teamwise.
翻译:我们总结了2022年的AVREVID Ad-Hoc Vide Search(AVS)实验。我们的解决方案是用两种新技术构建的,即轻量关注特质混凝土(LAFF),将不同的视觉/文字特征和双向偏斜学习(BNL)相结合,以解决含有否定信号的询问。特别是,LAFF在早期和后期以及文本和视频两端都进行特质融合,以利用多种(现成的)特征。与多头目自我关注相比,LAFF更为紧凑,也更为有效。它的注意力重量也可以用来选择较少的功能,而检索性能则大部分得到保存。BNLL通过将双向限制损失降到每三重来培训。LAFVV的三重内容包括给定的视频、最初的描述和部分否定的描述。关于视频特征的提取,我们使用预先训练的CLIP、BiST-101和RIS的第二版,我们使用BIF-G的BIF-L团队的B级和Bridele-Trading a dreal-IF-T。