Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications (e.g., remote healthcare and affective computing). Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. Furthermore, we also propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra- and cross-dataset testings. One highlight is that, unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community. The codes will be released at https://github.com/ZitongYu/PhysFormer.
翻译:远程光谱扫描(rPPG)旨在测量心脏活动和面部视频的生理信号,而没有任何接触,它在许多应用(如远程保健和感官计算)中具有巨大的潜力。最近深层的学习方法侧重于利用具有有限空间-时空可接受场的动态神经网络挖掘微妙的 RPPG线索,这些网络忽视了长距离空间-时空感知和RPPG模型互动。在本文中,我们提议采用Phys Former,即一个基于端到端视频变压器,以适应性地综合当地和全球的Spatio-时空功能,用于增强 ROPG 代表。作为Physformer的关键模块,时间差异变压器首先用时间引导全球注意力的偏差增强半周期性 RPPG 特征,然后完善本地的阵列学习和课程学习频率域的动态限制,为PhysFormalformer 提供详细的监督,并减轻超常调的当地和超常变压功能。全面实验将在四个内部变压网络上进行,这是我们所训练的变压数据系统。