This paper presents a novel Transformer-based facial landmark localization network named Localization Transformer (LOTR). The proposed framework is a direct coordinate regression approach leveraging a Transformer network to better utilize the spatial information in the feature map. An LOTR model consists of three main modules: 1) a visual backbone that converts an input image into a feature map, 2) a Transformer module that improves the feature representation from the visual backbone, and 3) a landmark prediction head that directly predicts the landmark coordinates from the Transformer's representation. Given cropped-and-aligned face images, the proposed LOTR can be trained end-to-end without requiring any post-processing steps. This paper also introduces the smooth-Wing loss function, which addresses the gradient discontinuity of the Wing loss, leading to better convergence than standard loss functions such as L1, L2, and Wing loss. Experimental results on the JD landmark dataset provided by the First Grand Challenge of 106-Point Facial Landmark Localization indicate the superiority of LOTR over the existing methods on the leaderboard and two recent heatmap-based approaches.
翻译:本文介绍了一个新的基于变压器的面部里程碑式本地化网络,名为本地化变异器(LOTR)。拟议框架是一个直接协调回归法,利用变压器网络更好地利用地貌图中的空间信息。LOTR模型由三个主要模块组成:1)一个将输入图像转换成地貌图的视觉主干柱,2)一个改进视觉主干柱特征表示的变压器模块,3)一个直接预测变压器代表面部标志性坐标的标志性预测头。考虑到作物式和组合式的面部图像,拟议的LOTR可以在无需任何后处理步骤的情况下接受端到端端培训。本文还介绍了滑动Wing损失功能,该功能涉及联队损失的梯度不连续,导致比标准损失函数(如L1、L2和翼损失)更加趋同。106-点地标地标第一个大挑战提供的JD里程碑式数据集的实验结果表明LOTR优于领导板上的现有方法和最近两个热马基方法。