This paper investigates the modeling of automated machine description on sports video, which has seen much progress recently. Nevertheless, state-of-the-art approaches fall quite short of capturing how human experts analyze sports scenes. There are several major reasons: (1) The used dataset is collected from non-official providers, which naturally creates a gap between models trained on those datasets and real-world applications; (2) previously proposed methods require extensive annotation efforts (i.e., player and ball segmentation at pixel level) on localizing useful visual features to yield acceptable results; (3) very few public datasets are available. In this paper, we propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning, to address the above challenges. We also design a unified approach to process raw videos into a stack of meaningful features with minimum labelling efforts, showing that cross modeling on such features using a transformer architecture leads to strong performance. In addition, we demonstrate the broad application of NSVA by addressing two additional tasks, namely fine-grained sports action recognition and salient player identification. Code and dataset are available at https://github.com/jackwu502/NSVA.
翻译:本文调查了体育视频自动机描述模型的建模,最近取得了很大进展,然而,最先进的方法远远不能捕捉人类专家如何分析体育场景,原因如下:(1) 使用过的数据集是从非官方提供者收集的,这自然造成关于这些数据集和现实世界应用的模型之间差距;(2) 先前提出的方法需要大量说明(即像素级的播放器和球分层),使有用的视觉特征本地化,以产生可接受的结果;(3) 很少有公共数据集。我们在本文件中提议建立一个新的大型NBA体育视频分析数据集,重点是说明上述挑战。我们还设计了一种统一的方法,将原始视频处理成一系列有意义的特征,同时作出最低限度的标签努力,表明使用变压器结构对这些特征进行交叉建模可带来强劲的绩效。此外,我们通过处理另外两项任务,即精细的体育动作识别和突出的玩家识别,展示了NSVA的广泛应用。我们还可以在 http://gius/givarvar/datasetgirm。