The design of more complex and powerful neural network models has significantly advanced the state-of-the-art in visual object tracking. These advances can be attributed to deeper networks, or the introduction of new building blocks, such as transformers. However, in the pursuit of increased tracking performance, runtime is often hindered. Furthermore, efficient tracking architectures have received surprisingly little attention. In this paper, we introduce the Exemplar Transformer, a transformer module utilizing a single instance level attention layer for realtime visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer modules, runs at 47 FPS on a CPU. This is up to 8x faster than other transformer-based models. When compared to lightweight trackers that can operate in realtime on standard CPUs, E.T.Track consistently outperforms all other methods on the LaSOT, OTB-100, NFS, TrackingNet, and VOT-ST2020 datasets. Code and models are available at https://github.com/pblatter/ettrack.
翻译:更复杂、更强大的神经网络模型的设计大大推进了视觉物体跟踪的最先进的神经网络模型,这些进步可归因于更深的网络,或引入新的构件,如变压器。然而,在追求提高跟踪性能的过程中,运行时间往往受到阻碍。此外,高效的跟踪结构受到的注意也少得令人惊讶。在本文中,我们引入了Exemplar变压器,这是一个变压器模块,利用单一的试度关注层实时视觉物体跟踪。E.T.Track,我们的视觉跟踪器,在47个FPS上运行,在47个FPS上运行。这比其他变压器模型快8x。与能够在标准CPU、OT-100、NFS、跟踪网和VOT-ST-220数据集上实时运行的轻量级跟踪器相比,E.Track始终超越所有其他方法。代码和模型见https://github.com/pblatter/ettrace。