Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.
翻译:利用现有的多式联运人员识别框架,主要假设总是有多式联运数据。在本文中,我们提议使用音频、可见和热摄像头建立一个新型的三模式传感器聚合框架,以解决缺失的模式问题。在这个框架内,提议建立一个称为AVTNet的新的深潜嵌入框架,以学习多种潜伏嵌入。此外,一个新的损失函数,称为失踪模式损失,说明根据三重损失计算方法,在学习个人潜伏嵌入时可能出现的缺失模式。此外,利用多头关注变压器学习利用三重数据联隐嵌式嵌入。不同的潜在嵌入随后被用于培训深层神经网络。提议的框架在语音数据集上得到验证。与基线算法进行比较分析表明,拟议的框架在计算缺失模式时,大大提高了人的识别准确性。