Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our MBVR in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.
翻译:视频搜索已成为用户发现与大型短视频共享平台文本查询相关的视频的主要例行做法。 在使用在线搜索日志培训询问视频双编码模型的过程中, 我们发现一种模式偏差现象, 视频编码器几乎完全依赖文本匹配, 忽略了视频的其他模式, 如视觉、 音频。 这种模式的不平衡源于模式差异: 查询和视频文本的相关性很容易学习, 因为查询也是文本的一部分, 与视频文本相同; b) 数据偏差: 大多数培训样本只能通过文本匹配来解决。 我们在这里分享我们改进第一个检索阶段的做法, 包括我们解决模式不平衡问题的方法。 我们建议MBR( 模式的短音频平衡视频检索), 有两个关键组成部分: 手动生成的模式拼贴的样本和基于视觉相关性的动态边际( DMD) 。 他们可以鼓励视频编码对每种模式给予平衡的关注。 通过对真实世界数据集的广泛实验, 我们从经验上展示了我们的方法, 改进了第一个检索阶段, 包括我们所观察到的大规模测试模型的A 。