The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms,~\textit{e.g.}, Kuaishou (Kwai), TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code can be found in https://github.com/wentaozhu/AutoShot.git .
翻译:短视频具有爆炸性的流行趋势,并且已经主导了新的社交媒体趋势。流行的短视频平台,例如快手(Kwai)、TikTok、Instagram Reels和YouTube Shorts,已经改变了我们消费和创造内容的方式。在视频内容的创建和理解方面,拍摄边界检测(SBD)是各种场景中最重要的组件之一。在这项工作中,我们发布了一个名为SHOT的新公共短视频拍摄边界检测数据集,其中包含853个完整的短视频和11606个拍摄注释,其中包含在200个测试视频中的2716个高质量拍摄边界注释。利用这份新的数据财富,我们提出了一种针对视频SBD的模型设计优化方法,通过在包含各种先进的3D ConvNets和Transformers的搜索空间中进行神经架构搜索。我们提出的方法名为AutoShot,在我们新构建的SHOT数据集上派生和评估时,其实现的F1分数高于以前的最新方法,例如比TransNetV2高4.2%。此外,为了验证AutoShot体系结构的普适性,我们直接在另外三个公共数据集上进行了评估:ClipShots、BBC和RAI,其中AutoShot的F1分数均超过以前的最新方法,分别高出1.1%、0.9%和1.2%。SHOT数据集和代码可在https://github.com/wentaozhu/AutoShot.git中找到。