Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
翻译:在线视频中的仇恨言论对数字平台构成日益严重的威胁,尤其是在视频内容日益多模态化和语境依赖的背景下。现有方法往往难以有效融合模态间复杂的语义关系,且缺乏理解微妙仇恨内容的能力。为解决这些问题,我们提出了一种创新的推理感知多模态融合(RAMF)框架。针对第一个挑战,我们设计了局部-全局上下文融合(LGCF)模块以捕捉局部显著线索和全局时序结构,并提出语义交叉注意力(SCA)机制以实现细粒度多模态语义交互。针对第二个挑战,我们引入对抗性推理——一种结构化的三阶段过程,其中视觉-语言模型生成(i)客观描述,(ii)仇恨假设推理,以及(iii)非仇恨假设推理——提供互补的语义视角,从而增强模型对微妙仇恨意图的语境理解。在两个真实世界仇恨视频数据集上的评估表明,我们的方法实现了鲁棒的泛化性能,在Macro-F1和仇恨类召回率上分别比现有最优方法提升了3%和7%。匿名期结束后我们将公开代码。