基于大型语言模型的视频理解研究综述 (Video Understanding with Large Language Models: A Survey)

Yolo Y. Tang,Jing Bi,Siting Xu,Luchuan Song,Susan Liang,Teng Wang,Daoan Zhang,Jie An,Jingyang Lin,Rongyi Zhu,Ali Vosoughi,Chao Huang,Zeliang Zhang,Pinxin Liu,Mingqian Feng,Feng Zheng,Jianguo Zhang,Ping Luo,Jiebo Luo,Chenliang Xu

from arxiv, Accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

翻译：随着在线视频平台的蓬勃发展和视频内容的急剧增长，对高效视频理解工具的需求显著增强。鉴于大型语言模型（LLMs）在语言和多模态任务中展现出的卓越能力，本文综述了近期利用LLMs进行视频理解（Vid-LLMs）的研究进展。Vid-LLMs所涌现的能力令人瞩目，尤其是其结合常识知识进行开放式多粒度（通用、时序与时空）推理的能力，为未来视频理解指明了有前景的发展方向。我们分析了Vid-LLMs的独特特性与能力，将现有方法归纳为三大类：视频分析器×LLM、视频嵌入器×LLM以及（分析器+嵌入器）×LLM。进一步地，根据LLMs在Vid-LLMs中的功能，我们识别出五种子类型：LLM作为摘要器、LLM作为管理器、LLM作为文本解码器、LLM作为回归器以及LLM作为隐藏层。此外，本综述系统梳理了Vid-LLMs相关的任务、数据集、基准测试及评估方法。同时，探讨了Vid-LLMs在多个领域的广泛应用，凸显了其在现实世界视频理解挑战中卓越的可扩展性与适应性。最后，总结了现有Vid-LLMs的局限性并展望了未来研究方向。更多信息，建议读者访问项目仓库：https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding。