The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: https://github.com/AndongDeng/BEAR
翻译:构建基准测试的目标是提供一个统一的协议,以便公平评估,并因此促进特定领域的发展。然而,我们指出,由于存在几种限制,现有的动作识别协议可能会产生部分评估。为了全面地探索空间时间表示学习的有效性,我们引入了新的视频动作识别基准测试BEAR。BEAR是18个视频数据集的集合,分为5类(异常,手势,日常,运动和指导),涵盖了多样的真实世界应用。通过BEAR,我们彻底评估6个常见的时空模型,这些模型既可由监督学习又可由自监督学习进行预训练。我们还通过标准微调、少样本微调和无监督领域自适应报告传输性能。我们的观察表明,当前的最新技术不能坚实地保证在接近真实应用的数据集上达到高水平的性能,我们希望BEAR能作为一个公平而具有挑战性的评估基准来获取关于构建下一代时空学习者的见解。我们的数据集、代码和模型都在 https://github.com/AndongDeng/BEAR 上发布。