迈向通用视频检索：通过合成多模态金字塔课程实现视频嵌入的泛化 (Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum)

The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.

翻译：当前主流的视频检索范式存在结构上的错位，因为狭窄的基准测试鼓励了相应有限的数据和单任务训练。因此，由于缺乏定义并需求多维泛化的诊断性评估，通用能力受到抑制。为打破这一循环，我们引入了一个基于评估、数据与建模协同设计的框架。首先，我们建立了通用视频检索基准（UVRB），这是一套包含16个数据集的测试集，不仅用于衡量性能，还能诊断跨任务和跨领域的关键能力差距。其次，在UVRB诊断结果的指导下，我们引入了一个可扩展的合成工作流，生成了155万对高质量数据对，以填充实现通用性所需的语义空间。最后，我们设计了模态金字塔课程，通过显式利用多样化数据中的潜在内在联系，来训练我们的通用视频嵌入器（GVE）。大量实验表明，GVE在UVRB上实现了最先进的零样本泛化能力。特别地，我们的分析揭示，流行的基准测试对通用能力的预测能力较差，且部分相关检索是一个主导但被忽视的场景。总体而言，我们协同设计的框架为摆脱有限范围、迈向真正通用的视频检索提供了一条实用路径。