While deep learning has been widely used for video analytics, such as video classification and action detection, dense action detection with fast-moving subjects from sports videos is still challenging. In this work, we release yet another sports video dataset $\textbf{P$^2$A}$ for $\underline{P}$ing $\underline{P}$ong-$\underline{A}$ction detection, which consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads. We work with a crew of table tennis professionals and referees to obtain fine-grained action labels (in 14 classes) for every ping-pong action that appeared in the dataset and formulate two sets of action detection problems - action localization and action recognition. We evaluate a number of commonly-seen action recognition (e.g., TSM, TSN, Video SwinTransformer, and Slowfast) and action localization models (e.g., BSN, BSN++, BMN, TCANet), using $\textbf{P$^2$A}$ for both problems, under various settings. These models can only achieve 48% area under the AR-AN curve for localization and 82% top-one accuracy for recognition since the ping-pong actions are dense with fast-moving subjects but broadcasting videos are with only 25 FPS. The results confirm that $\textbf{P$^2$A}$ is still a challenging task and can be used as a benchmark for action detection from videos.
翻译:虽然深度学习被广泛用于视频分析,例如视频分类和动作探测,但与体育录像中快速移动主题的密集行动探测仍然具有挑战性。在这项工作中,我们发布了另一套体育视频数据集$\textbf{P$2$A},用于$underline{P}$@underline{P}$_underline{P}$-美元(下线{A}美元)检测,其中包括从世界表网球锦标赛和奥林匹亚专业网球比赛播放视频中收集的2 721个视频剪辑。我们与一组桌式网球专业人员和推荐人合作,为在数据集中出现的每支乒乓行动(14个班)获得精选动作标签,并开发两套行动探测行动探测问题-行动本地定位(例如TSMO、TSN、视频Swinfurformation、SLastfast) 和动作本地化模型(例如BSNNE、B++、B、BMN$、TCANetnet),这些视频在快速定位定位中,这些视频只能在82A=FF域域域域域内,这些域域内,这些是用于快速定位,这些域域域域域域内,用于快速定位。