Existing benchmarks for evaluating long video understanding falls short on multiple aspects, either lacking in scale or quality of annotations. These limitations arise from the difficulty in collecting dense annotations for long videos (e.g. actions, dialogues, etc.), which are often obtained by manually labeling many frames per second. In this work, we introduce an automated Annotation and Video Stream Alignment Pipeline (abbreviated ASAP). We demonstrate the generality of ASAP by aligning unlabeled videos of four different sports (Cricket, Football, Basketball, and American Football) with their corresponding dense annotations (i.e. commentary) freely available on the web. Our human studies indicate that ASAP can align videos and annotations with high fidelity, precision, and speed. We then leverage ASAP scalability to create LCric, a large-scale long video understanding benchmark, with over 1000 hours of densely annotated long Cricket videos (with an average sample length of 50 mins) collected at virtually zero annotation cost. We benchmark and analyze state-of-the-art video understanding models on LCric through a large set of compositional multi-choice and regression queries. We establish a human baseline that indicates significant room for new research to explore. The dataset along with the code for ASAP and baselines can be accessed here: https://asap-benchmark.github.io/.
翻译:评估长期视频理解的现有基准在多个方面都存在不足,要么缺乏规模或说明质量,这些限制源于难以收集长视频(例如行动、对话等)的密集说明,这些说明往往是通过手工为每秒多框架贴上手动标签获得的。在这项工作中,我们引入了自动批注和视频流匹配管道(ASAP),通过将收集的四种不同运动(板球、足球、篮球和美国足球)的无标签视频与相应的密集说明(即评论)统一起来,在网络上可以免费查阅。我们的人类研究表明,ASAP能够将视频和说明与高度忠诚、精确和速度一致。然后,我们利用ASAP的缩放性来创建LCric,这是一个大型的长视频理解基准,有超过1 000小时的密集长的Cricket视频(平均样本长度为50分钟),以近乎零度的注解成本来显示。我们通过大量一系列的多层次的图像访问来测量和分析LCricrical的高级视频理解模型模型。我们在这里可以确定一个重大的多层次的基线,用于检索。