Transformer-based visual object tracking has been utilized extensively. However, the Transformer structure is lack of enough inductive bias. In addition, only focusing on encoding the global feature does harm to modeling local details, which restricts the capability of tracking in aerial robots. Specifically, with local-modeling to global-search mechanism, the proposed tracker replaces the global encoder by a novel local-recognition encoder. In the employed encoder, a local-recognition attention and a local element correction network are carefully designed for reducing the global redundant information interference and increasing local inductive bias. Meanwhile, the latter can model local object details precisely under aerial view through detail-inquiry net. The proposed method achieves competitive accuracy and robustness in several authoritative aerial benchmarks with 316 sequences in total. The proposed tracker's practicability and efficiency have been validated by the real-world tests.
翻译:以变换器为基础的视觉物体跟踪已被广泛使用。然而,变换器结构缺乏足够的感应偏差。此外,只注重全球特征编码,对模拟局部细节有害,这限制了对空中机器人的跟踪能力。具体地说,以本地模型为全球搜索机制,拟议的跟踪器用新的本地识别编码器取代全球编码器。在使用的编码器中,对当地识别的注意和本地元素校正网络进行了仔细设计,以减少全球冗余信息干扰和增加本地诱导偏差。与此同时,后者可以通过详细查询网在空中观察下精确地模拟本地物体细节。拟议方法在若干权威航空基准中实现了竞争性的准确性和稳健性,总共有316个序列。拟议跟踪器的实用性和效率得到了真实世界测试的验证。