As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.
翻译:随着计算机生成的内容和深假不断改进,多媒体法证的语义学方法将变得更加重要。 在本文中,我们引入了一个新的分类架构,以识别社交媒体新闻中视频外观和文本字幕之间的语义不一致之处。我们开发了一个多模式融合框架,以通过利用基于标题、自动音频转录、语义视频分析、物体探测、名称实体一致性和面部验证等文本分析的混合方法来识别社交媒体文章中的视频和字幕不匹配之处。为了培训和测试我们的方法,我们制作了一个新的基于视频的数据集,包括用于分析的4 000个真实世界脸书新闻文章。我们的多模式方法在字幕和外观之间的随机不匹配方面实现了60.5%的分类准确性,而单式模型的精确度低于50%。进一步的通缩研究证实有必要将各种模式混合起来,以正确识别语义不一致。