Top-performing landmark estimation algorithms are based on exploiting the excellent ability of large convolutional neural networks (CNNs) to represent local appearance. However, it is well known that they can only learn weak spatial relationships. To address this problem, we propose a model based on the combination of a CNN with a cascade of Graph Attention Network regressors. To this end, we introduce an encoding that jointly represents the appearance and location of facial landmarks and an attention mechanism to weigh the information according to its reliability. This is combined with a multi-task approach to initialize the location of graph nodes and a coarse-to-fine landmark description scheme. Our experiments confirm that the proposed model learns a global representation of the structure of the face, achieving top performance in popular benchmarks on head pose and landmark estimation. The improvement provided by our model is most significant in situations involving large changes in the local appearance of landmarks.
翻译:顶级地标估计算法的基础是利用大型进化神经网络(CNNs)的出色能力来代表当地外观,但众所周知,它们只能学习薄弱的空间关系。为了解决这一问题,我们提出了一个模型,其基础是CNN与一组图形关注网络递减器相结合。为此,我们引入了一个编码,它共同代表面部标志的外观和位置,以及根据其可靠性衡量信息的注意机制。这与启动图形节点位置的多任务方法以及粗略到细微的标志性描述方案相结合。我们的实验证实,拟议的模型学习了面部结构的全球代表性,在流行基准上达到头部的顶级表现和标志性估计。我们模型提供的改进在涉及地标的当地外观发生重大变化的情况下最为重要。