WebUAV-3M: 百万级深无人驾驶航空器跟踪能力统一基准 (WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV Tracking)

Unmanned aerial vehicle (UAV) tracking is of great significance for a wide range of applications, such as delivery and agriculture. Previous benchmarks in this area mainly focused on small-scale tracking problems while ignoring the amounts of data, types of data modalities, diversities of target categories and scenarios, and evaluation protocols involved, greatly hiding the massive power of deep UAV tracking. In this work, we propose WebUAV-3M, the largest public UAV tracking benchmark to date, to facilitate both the development and evaluation of deep UAV trackers. WebUAV-3M contains over 3.3 million frames across 4,500 videos and offers 223 highly diverse target categories. Each video is densely annotated with bounding boxes by an efficient and scalable semiautomatic target annotation (SATA) pipeline. Importantly, to take advantage of the complementary superiority of language and audio, we enrich WebUAV-3M by innovatively providing both natural language specifications and audio descriptions. We believe that such additions will greatly boost future research in terms of exploring language features and audio cues for multimodal UAV tracking. In addition, a fine-grained UAV tracking-under-scenario constraint (UTUSC) evaluation protocol and seven challenging scenario subtest sets are constructed to enable the community to develop, adapt and evaluate various types of advanced trackers. We provide extensive evaluations and detailed analyses of 43 representative trackers and envision future research directions in the field of deep UAV tracking and beyond. The dataset, toolkits and baseline results are available at \url{https://github.com/983632847/WebUAV-3M}.

翻译：在这项工作中,我们提议,迄今为止最大的公共无人驾驶航空飞行器跟踪基准WebUAV-3M(无人驾驶航空飞行器跟踪基准)对于诸如交付和农业等范围广泛的应用非常重要。这一领域的以往基准主要侧重于小规模跟踪问题,同时忽略了数据数量、数据模式类型、目标类别和情景的多样性以及所涉及的评价协议,大大掩盖了无人驾驶航空飞行器的深度跟踪的巨大力量。在这项工作中,我们提议迄今为止最大的公共无人驾驶航空飞行器跟踪基准WebUAV-3M(UAV)追踪器的开发和评估。WebUAV-3M(4 500视频)包含330万个以上的框架,提供了223个高度多样化的目标类别。每部视频都用一个高效和可扩缩的半自动目标说明(SATA)管道的捆绑盒进行密集的附加。为了利用语言和音频的互补优势,我们通过创新的方式提供自然语言规格和音频描述。我们认为,这些添加将大大促进未来研究,探讨现有通用航空飞行器跟踪的语文特征和音频信号信号。此外,对UAVAV-3的精确跟踪和深层轨道系统进行7级的跟踪和深层次分析,为未来路径提供具有挑战性的社区分析。