Transformer is showing its superiority over convolutional architectures in many vision tasks like image classification and object detection. However, the lacking of an explicit alignment mechanism limits its capability in person re-identification (re-ID), in which there are inevitable misalignment issues caused by pose/viewpoints variations, etc. On the other hand, the alignment paradigm of convolutional neural networks does not perform well in Transformer in our experiments. To address this problem, we develop a novel alignment framework for Transformer through adding the learnable vectors of "part tokens" to learn the part representations and integrating the part alignment into the self-attention. A part token only interacts with a subset of patch embeddings and learns to represent this subset. Based on the framework, we design an online Auto-Aligned Transformer (AAformer) to adaptively assign the patch embeddings of the same semantics to the identical part token in the running time. The part tokens can be regarded as the part prototypes, and a fast variant of Sinkhorn-Knopp algorithm is employed to cluster the patch embeddings to part tokens online. AAformer can be viewed as a new principled formulation for simultaneously learning both part alignment and part representations. Extensive experiments validate the effectiveness of part tokens and the superiority of AAformer over various state-of-the-art CNN-based methods. Our codes will be released.
翻译:在图像分类和对象探测等许多视觉任务中,变异器正在显示其优于革命结构。 但是,由于缺乏明确的校正机制,限制了其个人再识别(re-ID)的能力,在重新识别(re-ID)方面,由于配置/视图点的变化,不可避免地会出现不匹配问题。 另一方面,变异神经网络的校正模式在我们实验的变异器中表现不佳。为了解决这个问题,我们为变异器开发了一个全新的变异器调整框架,方法是增加学习“部分标志”的可学习矢量“部分标志”,以学习部分表示和将部分调整纳入自我意识。一个部分象征只与一组补丁嵌入和学习来代表这个组别。基于这个框架,我们设计了一个在线自动调整的变异变器(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAex),以适应性方式将同一语系的嵌合嵌合在运行期间的象征。部分符号可以被视为部分的原型符号,而Sinkshown-Knopp 算算算法的快速变组合用于将补补装成一个部分的升级的升级的升级。AAAAAAAAAASRARA正正的演示演示的演示的演示的演示部分,可以同时同时进行演示的演示的演示的演示的演示的演示的演示的演示的演示部分。