The design of more complex and powerful neural network models has significantly advanced the state-of-the-art in visual object tracking. These advances can be attributed to deeper networks, or the introduction of new building blocks, such as transformers. However, in the pursuit of increased tracking performance, runtime is often hindered. Furthermore, efficient tracking architectures have received surprisingly little attention. In this paper, we introduce the Exemplar Transformer, a transformer module utilizing a single instance level attention layer for realtime visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer modules, runs at 47 FPS on a CPU. This is up to 8x faster than other transformer-based models. When compared to lightweight trackers that can operate in realtime on standard CPUs, E.T.Track consistently outperforms all other methods on the LaSOT, OTB-100, NFS, TrackingNet, and VOT-ST2020 datasets. The code will be made publicly available upon publication.
翻译:更复杂、更强大的神经网络模型的设计大大推进了视觉物体跟踪的最先进技术,这些进步可归因于更深的网络,或引入新的构件,如变压器。然而,在追求提高跟踪性能的过程中,运行时间往往受到阻碍。此外,高效的跟踪结构受到的注意也少得令人惊讶。在本文中,我们引入了Exemplar变压器,这是一个变压器模块,利用单一的试度关注层实时视觉物体跟踪。E.T.Track,我们的视觉跟踪器,在47 FPS运行的Exmplar变压器模块,运行于47 FPS,比其他变压器模型速度快8x。与能够在标准CPU上实时运行的轻量跟踪器相比,E.T.Track 持续超越LaSOT、OT-100、NFS、跟踪网和VOT-ST220数据集的所有其他方法。该代码将在出版物上公布。