Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.
翻译:基于对比学习的视频-语言表示学习方法(例如CLIP)已经取得了出色的性能,这种方法通过对预定义的视频-文本对进行语义交互。为了澄清这种粗粒度的全局交互并向前迈出一步,我们必须面对艰巨的、微妙的交叉模态细粒度学习中的破壳互动。在本文中,我们将视频-文本创新地建模为多元协作博弈理论中的游戏玩家,以智慧地处理细粒度、灵活、模糊的不确定性语义互动。具体而言,我们提出了分层班扎夫交互(Hierarchical Banzhaf Interaction,HBI)来评价视频帧与文本单词之间的可能对应关系,以实现敏感和可解释的交叉模态对比。为了高效地实现多个视频帧和多个文本单词之间的协作博弈,所提出的方法将原始视频帧(文本单词)进行聚类,并计算合并的令牌之间的班扎夫交互。通过堆叠令牌合并模块,我们实现了不同语义层次上的协作博弈。在常用的文本-视频检索和视频问答基准测试上进行的大量实验表明,我们的HBI具有优越的性能。更鼓舞人心的是,它还可以作为一种可视化工具,促进跨模态交互的理解,对社区产生深远的影响。该项目页面可在https://jpthu17.github.io/HBI/ 上获得。