DEFT-LLM：面向微表情识别的解耦专家特征调优 (DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition)

Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.

翻译：微表情识别对于推断真实情感至关重要。将多模态大语言模型应用于该任务，可实现面部运动的时空分析并提供可解释的描述。然而，仍存在两个核心挑战：（1）静态外观与动态运动线索的纠缠阻碍了模型对细微运动的专注；（2）现有微表情数据集中的文本标签未能完全对应潜在的面部肌肉运动，导致文本监督与物理运动之间存在语义鸿沟。为解决这些问题，我们提出了DEFT-LLM，通过多专家解耦实现运动语义对齐。我们首先引入了Uni-MER，一个为对齐文本与局部面部运动而设计的运动驱动指令数据集。其构建利用了光流和动作单元标签的双重约束，以确保时空一致性以及与运动的合理对应。随后，我们设计了一个包含三个专家的架构，将面部动态解耦为独立且可解释的表征（结构、动态纹理和运动语义）。通过将Uni-MER中指令对齐的知识整合到DEFT-LLM中，我们的方法为微表情注入了有效的物理先验，同时利用了大语言模型的跨模态推理能力，从而能够精确捕捉细微的情感线索。在多个具有挑战性的微表情识别基准测试上的实验表明，该方法取得了最先进的性能，并在局部面部运动的可解释建模方面展现出独特优势。