Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
翻译:针对真实场景对话的生成式心理分析面临两个根本性挑战:(1)现有视觉-语言模型(VLMs)无法解决发音-情感歧义问题,即语音的视觉模式会模仿情感表达;(2)由于缺乏能够评估视觉基础与推理深度的可验证评价指标,研究进展受到阻碍。我们提出了一个完整的生态系统以应对这两项挑战。首先,我们引入了多层级洞察解耦网络(Multilevel Insight Network for Disentanglement,MIND),这是一种新颖的层次化视觉编码器,其通过引入状态判断模块,基于唇部特征的时序方差算法性地抑制歧义性唇部特征,实现了显式的视觉解耦。其次,我们构建了ConvoInsight-DB,这是一个包含微表情与深度心理推理专家标注的大规模新数据集。第三,我们设计了心理推理洞察评级度量(Mental Reasoning Insight Rating Metric,PRISM),这是一个利用专家引导的大型语言模型(LLM)来评估大型心理视觉模型多维性能的自动化维度框架。在我们的PRISM基准测试中,MIND显著优于所有基线模型,在微表情检测上相比先前最优方法取得了+86.95%的性能提升。消融研究证实,我们的状态判断解耦模块是实现此性能飞跃的最关键组件。我们的代码已开源。