Existing benchmarks for evaluating long video understanding falls short on multiple aspects, either lacking in scale or quality of annotations. These limitations arise from the difficulty in collecting dense annotations for long videos (e.g. actions, dialogues, etc.), which are often obtained by manually labeling many frames per second. In this work, we introduce an automated Annotation and Video Stream Alignment Pipeline (abbreviated ASAP). We demonstrate the generality of ASAP by aligning unlabeled videos of four different sports (Cricket, Football, Basketball, and American Football) with their corresponding dense annotations (i.e. commentary) freely available on the web. Our human studies indicate that ASAP can align videos and annotations with high fidelity, precision, and speed. We then leverage ASAP scalability to create LCric, a large-scale long video understanding benchmark, with over 1000 hours of densely annotated long Cricket videos (with an average sample length of 50 mins) collected at virtually zero annotation cost. We benchmark and analyze state-of-the-art video understanding models on LCric through a large set of compositional multi-choice and regression queries. We establish a human baseline that indicates significant room for new research to explore.
翻译:评估长期视频理解的现有基准在多个方面都不足,要么缺乏规模或说明质量,这些限制源于难以收集长视频(如行动、对话等)的密集说明,这些说明往往通过手工为每秒多框架贴上手动标签获得。在这项工作中,我们引入了自动注解和视频流调整管道(ASAP),通过将收集的四种不同运动(板球、足球、篮球和美国足球)的无标签视频与相应的密集说明(即评论)统一起来,在网络上免费查阅。我们的人类研究表明,ASAP能够将视频和说明与高度忠诚、精确和快速地统一起来。然后,我们利用ASAP的可缩放性来创建LCric,一个大型的长视频理解基准,1 000多小时以近乎零度注解算成本收集的粗长的Cricket视频(平均样本长度为50分钟)。我们通过大量多人基底的多层研究,对LCric的图像进行基准和分析。