Temporal video segmentation and classification have been advanced greatly by public benchmarks in recent years. However, such research still mainly focuses on human actions, failing to describe videos in a holistic view. In addition, previous research tends to pay much attention to visual information yet ignores the multi-modal nature of videos. To fill this gap, we construct the Tencent `Ads Video Segmentation'~(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as `presentation form', `place', and `style', and contains rich multi-modal information such as video, audio, and text. TAVS is organized hierarchically in semantic aspects for comprehensive temporal video segmentation with three levels of categories for multi-label classification, e.g., `place' - `working place' - `office'. Therefore, TAVS is distinguished from previous temporal segmentation datasets due to its multi-modal information, holistic view of categories, and hierarchical granularities. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels. Accompanied with TAVS, we also present a strong multi-modal video segmentation baseline coupled with multi-label class prediction. Extensive experiments are conducted to evaluate our proposed method as well as existing representative methods to reveal key challenges of our dataset TAVS.
翻译:近些年来,公众基准大大推动了对时间的视频截断和分类,然而,这类研究仍然主要侧重于人类行动,未能以整体观点描述视频,此外,先前的研究往往对视觉信息给予很大关注,却忽视视频的多式性质。为填补这一空白,我们在广告域中构建了Tencent “Ads Videcal section”~(TAVS)数据集,将多式视频分析升级到一个新的级别。TAVS将三个独立视角的视频描述为“展示形式”、“地点”和“风格”,并包含丰富的多种模式信息,如视频、音频和文本。TAVSS按等级排列,在语义方面按等级排列,以综合时间视频截断,分为三个等级,如“地点”-“办公地点”-“办公室”等。因此,TAVSS将先前的时间分割数据集与以往的多式结构分割数据集区分开来,因为其多式信息、对类别的整体视图和等级的坚固度,它包括12 000个视频、82个AVS类、339级、以我们现有的模拟模拟、12-AVS-rodal-rodal-rodal、12-ass-rodal、12-rodal-ass-roud、12-rodal-ass-rod-rouddal-tad-ass、12-rod-ass-sil-ta-routal-ta-s-sil-ta-ta-tail sq-ta-tail-tail-tail-tail-tail-