With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
翻译:随着视频在社交媒体、监控和教育领域的爆炸式增长,将长视频素材压缩为简洁而忠实的内容摘要变得至关重要。有监督方法通过密集标注学习帧/镜头的显著性,在特定领域表现出色,但其标注成本高昂且跨数据集泛化能力脆弱;无监督方法无需标注,但常常遗漏高级语义和叙事线索。最近的零样本方法利用大语言模型实现免训练摘要,但仍对手工设计的提示词和数据集特定的归一化操作敏感。我们提出了一种基于评分准则引导的伪标签提示框架。将少量人工标注转化为高置信度的伪标签,并聚合为结构化、适应数据集的评分准则,以实现可解释的场景评估。在推理阶段,边界场景(首尾)根据其自身描述进行评分,而中间场景则包含相邻片段的简要摘要,以评估进展性和冗余度,从而使大语言模型能够在无需参数调优的情况下平衡局部显著性与全局连贯性。在三个基准测试中,我们的方法均表现出稳定的有效性。在SumMe和TVSum数据集上,其F1分数分别达到57.58和63.05,超越零样本基线(56.73, 62.21)0.85和0.84分,并接近有监督方法的性能。在面向查询的QFVS基准测试中,该方法获得53.79的F1分数,以0.37分的优势超过基线(53.42),且在验证视频中保持稳定。这些结果表明,评分准则引导的伪标签标注结合上下文提示,能够稳定基于大语言模型的评分,并为通用和面向查询的视频摘要任务提供一个通用且可解释的零样本范式。