Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHIGHLIGHTS) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, MomentDETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr
翻译:从自然语言(NL)用户询问的视频中检测定制片段和亮点是一个重要但研究不足的专题。追求这一方向的挑战之一是缺少附加说明的数据。为了解决这一问题,我们展示了基于查询的视频亮点(QVHIGHLightS)数据集。它由10,000多个YouTube视频组成,涵盖从日常活动和生活方式视频旅行到新闻视频中的社会和政治活动等广泛主题,从日常活动和生活方式视频旅行到社交视频。数据集中的每个视频都有以下附加说明:(1) 人写自由的NL查询,(2) 视频 w.r.t. 查询的相关时刻,以及(3) 与所有查询相关的剪辑的五分级亮点评分。这个综合注解使我们能够开发并评价能够探测相关时刻以及不同、灵活的用户查询前端的亮点的系统。我们还展示了当前变压器变码的模型,将实时检索视为直接设定的预测问题,将视频和查询图示作为提取的视频和查询图示,同时利用了我们之前的图表坐标和前端结构,将展示了我们以前的前端结构。