Holistic methods using CNNs and margin-based losses have dominated research on face recognition. In this work, we depart from this setting in two ways: (a) we employ the Vision Transformer as an architecture for training a very strong baseline for face recognition, simply called fViT, which already surpasses most state-of-the-art face recognition methods. (b) Secondly, we capitalize on the Transformer's inherent property to process information (visual tokens) extracted from irregular grids to devise a pipeline for face recognition which is reminiscent of part-based face recognition methods. Our pipeline, called part fViT, simply comprises a lightweight network to predict the coordinates of facial landmarks followed by the Vision Transformer operating on patches extracted from the predicted landmarks, and it is trained end-to-end with no landmark supervision. By learning to extract discriminative patches, our part-based Transformer further boosts the accuracy of our Vision Transformer baseline achieving state-of-the-art accuracy on several face recognition benchmarks.
翻译:使用有线电视新闻网和基于边际的亏损的全方位方法主导了面部识别研究。在这项工作中,我们以两种方式偏离了这一背景:(a) 我们使用视野变换器作为结构来训练一个非常强大的表面识别基线,简称FVYT,它已经超过了大多数最先进的面部识别方法。 (b) 其次,我们利用变换器的固有特性来处理从非常规网格中提取的信息(视觉符号),以便设计一个面部识别管道,它与基于部分面部识别方法相仿。我们称为FVIT的管道,只是包括一个轻量网络,用来预测在从预测的地标中提取的补丁上运行的视野变异器所遵循的面标志的坐标,它经过培训后端到端,没有里程碑式监督。我们基于部分的变换器通过学习歧视性的补丁,进一步提升我们愿景变换器基线的准确性,在几个面部识别基准上达到最新精确度。