Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.
翻译:在实际环境中部署情感识别系统仍面临重大挑战,这些环境要求设备小型化、低功耗且具备隐私保护能力。这对于压力监测、冲突降级和响应式可穿戴设备等应用尤为重要,因为基于云的解决方案在这些场景中并不适用。尽管深度学习推动了多模态情感识别的发展,但现有系统大多仍无法部署在资源极度受限的边缘设备上。先前的研究通常依赖高性能硬件、缺乏实时处理能力或仅使用单模态输入。本文通过提出一种硬件感知的情感识别系统来填补这一空白,该系统采用针对Edge TPU优化的后融合架构,结合声学与语言特征。该设计将基于量化Transformer的声学模型与来自DSResNet-SE网络的冻结关键词嵌入相集成,可在1.8MB内存预算和21-23ms延迟内实现实时推理。该流程通过MicroFrontend和MLTK确保训练与部署阶段的声谱图对齐。在使用Coral Dev Board Micro麦克风采集的重录制分段IEMOCAP样本上的评估表明,其宏F1分数较单模态基线提升6.3%。本研究表明,通过任务特定的融合策略和硬件导向的模型设计,在微控制器级边缘平台上实现精确的实时多模态情感推理是可行的。