In this work, we contribute a new million-scale Unmanned Aerial Vehicle (UAV) tracking benchmark, called WebUAV-3M. Firstly, we collect 4,485 videos with more than 3M frames from the Internet. Then, an efficient and scalable Semi-Automatic Target Annotation (SATA) pipeline is devised to label the tremendous WebUAV-3M in every frame. To the best of our knowledge, the densely bounding box annotated WebUAV-3M is by far the largest public UAV tracking benchmark. We expect to pave the way for the follow-up study in the UAV tracking by establishing a million-scale annotated benchmark covering a wide range of target categories. Moreover, considering the close connections among visual appearance, natural language and audio, we enrich WebUAV-3M by providing natural language specification and audio description, encouraging the exploration of natural language features and audio cues for UAV tracking. Equipped with this benchmark, we delve into million-scale deep UAV tracking problems, aiming to provide the community with a dedicated large-scale benchmark for training deep UAV trackers and evaluating UAV tracking approaches. Extensive experiments on WebUAV-3M demonstrate that there is still a big room for robust deep UAV tracking improvements. The dataset, toolkits and baseline results will be available at \url{https://github.com/983632847/WebUAV-3M}.
翻译:在这项工作中,我们贡献了一个新的100万规模的无人驾驶航空飞行器追踪基准,称为WebUAV-3M。首先,我们从互联网上收集了4 485个视频,其范围超过3M框架。然后,设计了一个高效和可扩缩的半自动目标说明(SATA)管道,在每一个框架中贴上巨大的WebUAV-3M的标签。据我们所知,一个带有附加说明的WebUAV-3M(UAV)的密集捆绑盒是迄今为止最大的公共UAV跟踪基准。我们期望为UAV跟踪跟踪的后续研究铺平道路,为此建立一个涵盖广泛目标类别的100万规模的附加说明基准。此外,考虑到视觉外观、自然语言和听觉之间的密切联系,我们通过提供自然语言规格和音频描述来丰富WebUAV-3MM(WAV-3M)的丰富内容,鼓励探索自然语言特征和音道提示。我们利用这一基准,我们探索了百万个规模的深 UAVAV跟踪问题,目的是为社区提供一个专门的大规模基准基准,用于培训深AVA-B追踪者,并评估UAAVAV-38的深度追踪方法。