Most existing Siamese-based tracking methods execute the classification and regression of the target object based on the similarity maps. However, they either employ a single map from the last convolutional layer which degrades the localization accuracy in complex scenarios or separately use multiple maps for decision making, introducing intractable computations for aerial mobile platforms. Thus, in this work, we propose an efficient and effective hierarchical feature transformer (HiFT) for aerial tracking. Hierarchical similarity maps generated by multi-level convolutional layers are fed into the feature transformer to achieve the interactive fusion of spatial (shallow layers) and semantics cues (deep layers). Consequently, not only the global contextual information can be raised, facilitating the target search, but also our end-to-end architecture with the transformer can efficiently learn the interdependencies among multi-level features, thereby discovering a tracking-tailored feature space with strong discriminability. Comprehensive evaluations on four aerial benchmarks have proven the effectiveness of HiFT. Real-world tests on the aerial platform have strongly validated its practicability with a real-time speed. Our code is available at https://github.com/vision4robotics/HiFT.
翻译:现有大多数以暹粒为基础的大多数现有跟踪方法都根据相近地图对目标对象进行分类和回归,但是,它们要么使用上一个进化层的单一地图,该地图会降低复杂情景中本地化的准确度,要么单独使用多种地图进行决策,对空中移动平台进行难以操作的计算,因此,在这项工作中,我们提议为空中跟踪提供高效和有效的等级地貌变压器(HiFT),将多级共振层产生的等级相似地图输入地貌变异器,以实现空间(浅层)和语义提示(深层)的交互融合。因此,不仅可以提高全球环境信息,便利目标搜索,而且我们与变压器的端对端结构能够有效地了解多级特征之间的相互依存关系,从而发现一个可产生强烈矛盾的跟踪相联的地貌空间。对四个空中基准的全面评价证明了HiFT的有效性。