Current developments in temporal event or action localization usually target actions captured by a single camera. However, extensive events or actions in the wild may be captured as a sequence of shots by multiple cameras at different positions. In this paper, we propose a new and challenging task called multi-shot temporal event localization, and accordingly, collect a large scale dataset called MUlti-Shot EventS (MUSES). MUSES has 31,477 event instances for a total of 716 video hours. The core nature of MUSES is the frequent shot cuts, for an average of 19 shots per instance and 176 shots per video, which induces large intrainstance variations. Our comprehensive evaluations show that the state-of-the-art method in temporal action localization only achieves an mAP of 13.1% at IoU=0.5. As a minor contribution, we present a simple baseline approach for handling the intra-instance variations, which reports an mAP of 18.9% on MUSES and 56.9% on THUMOS14 at IoU=0.5. To facilitate research in this direction, we release the dataset and the project code at https://songbai.site/muses/ .
翻译:时间事件或行动定位的当前动态通常以单个摄像头所捕捉到的行动为目标。 但是,野外的广泛事件或行动可以在不同位置的多个摄像头拍摄成一个射击序列。 在本文中,我们提议了一项名为多发时间事件定位的具有挑战性的新任务,因此,我们收集了一个称为Multi-Shot EpentS(MUSES)的大规模数据集。MUSES有31,477个事件实例,总共716个视频小时。MUSES的核心性质是频繁的射击削减,平均每场19个镜头和每场视频176个镜头,这会引起巨大的内部变化。我们的全面评估显示,在时间行动定位中,最先进的方法只达到IoU=0.5的13.1%的 mAP。作为微小的贡献,我们提出了一个处理内部变化的简单基线方法,其中报告了关于MUSES的MAP为18.9%,关于THUOS14的频率为56.9%,在IoU=0.5。为了便利这方面的研究,我们公布数据设置/musimes/ actions@ https:// httpssssemus.