A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT is evaluated on seven benchmark datasets and compared with ten existing state-of-the-art methods, including CNNs, pure ViTs, and Convolutional ViTs. Despite fewer parameters, FPVT has demonstrated excellent performance over the compared methods. Project page is available at https://khawar-islam.github.io/fpvt/
翻译:提出了一个新的面孔金字形视觉变异器(FPVT),以学习具有歧视性的多尺度面部表情表情,供面部识别和核实。在FPVT框架内,建议采用面部空间减少注意(FRSA)和尺寸减少(FDR)层,以形成地貌地图,从而减少计算。建议采用改进的补丁嵌入算法,以利用有线电视网在VIT(例如,共享重量、当地背景和可接受字段)中的优势,为高层次的语义原始体建模。在FPVT框架内,提议采用进化进化进化进化网络(CFFNF),提取地点信息,以学习低水平面部面信息。拟议的FPVT在7个基准数据集上进行了评价,并与现有的10个最新方法,包括CNN、纯VIT和CVTD/CULAVT等,尽管参数较少,但FPVT在比较方法上表现出色。项目网页见https://khawar-islam.github.fpvvt/pt。