We present a novel method for multi image domain and multi-landmark definition learning for small dataset facial localization. Training a small dataset alongside a large(r) dataset helps with robust learning for the former, and provides a universal mechanism for facial landmark localization for new and/or smaller standard datasets. To this end, we propose a Vision Transformer encoder with a novel decoder with a definition agnostic shared landmark semantic group structured prior, that is learnt, as we train on more than one dataset concurrently. Due to our novel definition agnostic group prior the datasets may vary in landmark definitions and domains. During the decoder stage we use cross- and self-attention, whose output is later fed into domain/definition specific heads that minimize a Laplacian-log-likelihood loss. We achieve state-of-the-art performance on standard landmark localization datasets such as COFW and WFLW, when trained with a bigger dataset. We also show state-of-the-art performance on several varied image domain small datasets for animals, caricatures, and facial portrait paintings. Further, we contribute a small dataset (150 images) of pareidolias to show efficacy of our method. Finally, we provide several analysis and ablation studies to justify our claims.
翻译:我们为小型数据集的面部本地化提供了一个多图像域和多陆标记定义学习的新方法。 在大型(r)数据集的同时培训一个小型数据集,有助于对前者进行强有力的学习,并为新的和(或)较小的标准数据集提供一个通用的面部里程碑定位机制。 为此,我们提议了一个具有新颖解码器的愿景变异器编码器,该编码器具有定义性共同标志性共同标志性语义组结构,在我们同时培训不止一个数据集时学习。由于我们的新定义,在数据组之前的不可知性组可能会在里程碑定义和域上出现差异。在解码阶段,我们使用交叉和自我目的,其输出随后被输入到域/定义特定头,以最大限度地减少拉普拉西亚-loglishing损失。我们在接受更大数据集培训时,在标准地标化数据集(如COFW和WLFW)上取得了最先进的表现。 我们还展示了几个不同图像域域域域域域的状态性表现,用于动物、刻度、卡利卡仪、图像的最终分析。我们提供了几种数据分析方法。