多模态人类状态识别的Husformer：一种多模态变压器模型 (Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition)

Human state recognition is a critical topic with pervasive and important applications in human-machine systems. Multi-modal fusion, the combination of metrics from multiple data sources, has been shown as a sound method for improving the recognition performance. However, while promising results have been reported by recent multi-modal-based models, they generally fail to leverage the sophisticated fusion strategies that would model sufficient cross-modal interactions when producing the fusion representation; instead, current methods rely on lengthy and inconsistent data preprocessing and feature crafting. To address this limitation, we propose an end-to-end multi-modal transformer framework for multi-modal human state recognition called Husformer. Specifically, we propose to use cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Using two such attention mechanisms enables effective and adaptive adjustments to noise and interruptions in multi-modal signals during the fusion process and in relation to high-level features. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive workload datasets (MOCAS and CogLoad) demonstrate that in the recognition of human state, our Husformer outperforms both state-of-the-art multi-modal baselines and the use of a single modality by a large margin, especially when dealing with raw multi-modal signals. We also conducted an ablation study to show the benefits of each component in Husformer.

翻译：摘要：人类状态识别是人机系统领域中具有普遍且重要应用的关键主题。多模态融合，即从多个数据源中组合指标，已被证明是提高识别性能的一种有效方法。然而，尽管最近的基于多模态的模型报告了有希望的结果，但它们通常无法利用复杂的融合策略来建模生产融合表示时的足够跨模态交互。相反，当前的方法依赖于冗长且不一致的数据预处理和特征构建。为了解决这个限制，我们提出了一种用于多模态人类状态识别的端到端多模态变压器框架——Husformer。具体来说，我们建议使用跨模态变压器，这些变压器通过直接关注其他模态中揭示的潜在相关性来启发一种模态来巩固自己，以融合不同的模态，并确保引入的跨模态交互有足够的意识。随后，我们利用自我注意力变压器来进一步优先考虑融合表示中的上下文信息。使用两种这样的注意机制，使得在融合过程中能够针对多模态信号中的噪声和中断进行有效的自适应调整。在处理高级特征时也能够做到自适应的调整。在两个情绪语料库（DEAP和WESAD）以及两个认知负载数据集（MOCAS和CogLoad）上进行的广泛实验表明，在识别人类状态方面，我们的Husformer在处理原始的多模态信号时，相较于最先进的多模态基线和单模态的使用，都具有显著的优势。我们还进行了消融实验，以展示Husformer中每个组件的益处。