视觉物体跟踪学习空间-公平空间变换器 (Learning Spatial-Frequency Transformer for Visual Object Tracking)

Recent trackers adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. Although their trackers work well in regular scenarios, however, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or key/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is generated using dual Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced by multiplying Query and Key features in self-attention. The output will be fed into a Softmax layer and then decomposed into two components, i.e., the direct signal and high-frequency signal. The low- and high-pass branches are rescaled and combined to achieve all-pass, therefore, the high-frequency features will be protected well in stacked self-attention layers. We further integrate the Spatial-Frequency Transformer into the Siamese tracking framework and propose a novel tracking algorithm, termed SFTransT. The cross-scale fusion based SwinTransformer is adopted as the backbone, and also a multi-head cross-attention module is used to boost the interaction between search and template features. The output will be fed into the tracking head for target localization. Extensive experiments on both short-term and long-term tracking benchmarks all demonstrate the effectiveness of our proposed framework.

翻译：最近的跟踪器采用“ 变压器” 组合或取代广泛使用的 ResNet 作为其新的主干网网络。虽然它们的跟踪器在常规情景中运作良好, 但是它们只是将 2D 特性平整成一个序列, 以更好地匹配变压器。我们相信这些操作忽略了目标对象的空间前方, 这可能只导致亚最佳结果。此外, 许多工程显示, 自控实际上是一个低通道过滤器, 它独立于输入功能或密钥/ 询问。也就是说, 它可能会压制输入输入输入功能中的高频输入功能, 保存甚至扩大低频信息。要处理这些问题, 我们在此文件中提议一个统一的空间- 变压变压变换器, 同时模拟高频上的空间和高频强调注意。具体地, 高频变压前置器实际上是一个双向过滤器过滤器过滤器, 并且将自动变换为自动变压的快速变压器。因此, 将输出输出将输入到一个软的直流、直流、直径、直流、直流、直流、流、直流、流、流、向、向、向、向、直流、向、向、向、直流、直流、向、向、向、向、流、直流、向、向、向、直流、向、向、流、流、、、、流、向、向、向、向、和流、向、向、向、、向、向、向、向、向、、、向、向、向、、、、、、、、、、、流、、、、流、流、、流、流、、、、、流、、流、、、、、、、、、、、、、、、、、、、、、、、、流、、、、流、和