This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches. On the WFLW dataset, the proposed LOTR framework demonstrates promising results compared with several state-of-the-art methods. Additionally, we report the improvement in state-of-the-art face recognition performance when using our proposed LOTRs for face alignment.
翻译:本文介绍了一个新的基于变压器的面部里程碑式本地化网络,名为“本地化变异器”。拟议框架是一个直接协调回归法,利用变压器网络,更好地利用地貌图中的空间信息。LOTR模型由三个主要模块组成:1)一个将输入图像转换成地貌图的视觉主干柱,2)一个改进视觉主干柱特征的变压器模块,3)一个直接预测变压器所代表的地标坐标的标志性预测头。鉴于作物化和相容图像,拟议的LOTR可以在无需任何后处理步骤的情况下接受端到端培训。本文还介绍了平网损失功能,处理联队损失的梯度不连续,比标准损失函数(如L1、L2和翼损失)更加趋同。第一个大挑战(106点地标地标)提供的JD里程碑式数据集的实验结果表明,LOTR优于领导板上的现有方法和最近两个基于热马基的方法。在WLFLFW数据库中,拟议的LOTR框架在使用我们的拟议地面调整报告中展示了有希望的成绩的成绩。