Spatio-temporal action detection is an important and challenging problem in video understanding. The existing action detection benchmarks are limited in aspects of small numbers of instances in a trimmed video or relatively low-level atomic actions. This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports. We first analyze the important ingredients of constructing a realistic and challenging dataset for spatio-temporal action detection by proposing three criteria: (1) motion dependent identification, (2) with well-defined boundaries, (3) relatively high-level classes. Based on these guidelines, we build the dataset of Multi-Sports v1.0 by selecting 4 sports classes, collecting around 3200 video clips, and annotating around 37790 action instances with 907k bounding boxes. Our datasets are characterized with important properties of strong diversity, detailed annotation, and high quality. Our MultiSports, with its realistic setting and dense annotations, exposes the intrinsic challenge of action localization. To benchmark this, we adapt several representative methods to our dataset and give an in-depth analysis on the difficulty of action localization in our dataset. We hope our MultiSports can serve as a standard benchmark for spatio-temporal action detection in the future. Our dataset website is at https://deeperaction.github.io/multisports/.
翻译:在视频理解方面,发现时空运动是一个重要而具有挑战性的问题。现有的行动检测基准在数量较少的短片或相对较低的原子动作中数量有限。本文的目的是提供一个新的多人数据组,以“多运动”的形式展示时空局部运动行动的数据组。我们首先分析为时空行动检测构建一个现实而具有挑战性的数据集的重要内容,提出三个标准:(1) 运动依附识别,(2) 有明确界定的界限,(3) 相对较高的等级。根据这些准则,我们通过选择4个体育课,收集大约3200个视频剪辑,以及用907公里的框说明大约3 790个行动。我们的数据集具有很强的多样性、详细注解和高质量的重要特性。我们的多功能及其现实设置和密集的描述,暴露了行动本地化的内在挑战。为了衡量这一点,我们调整了多功能运动的数据集组,并给出了我们未来行动目标的深度分析。我们的数据集/多功能网站可以作为我们未来行动的基准。