Social media is daily creating massive multimedia content with paired image and text, presenting the pressing need to automate the vision and language understanding for various multimodal classification tasks. Compared to the commonly researched visual-lingual data, social media posts tend to exhibit more implicit image-text relations. To better glue the cross-modal semantics therein, we capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity. Afterwards, the classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales in existing benchmarks. Substantial experiments are conducted on four multimodal social media benchmarks for image text relation classification, sarcasm detection, sentiment classification, and hate speech detection. The results show that our method further advances the performance of previous state-of-the-art models, which do not employ comment modeling or self-training.
翻译:社交媒体每天都会创造大量的多媒体内容,其中包含匹配的图像和文本,这呈现了自动化各种多模态分类任务的紧迫需求。与通常研究的视觉-语言数据相比,社交媒体帖子往往展现出更多的隐性图像-文本关系。为了更好地连接其中的跨模态语义,我们从用户评论中捕获暗示性的特征,这些评论是通过共同利用视觉和语言相似性检索到的。之后,在以教师-student框架自训练的方式下,探索分类任务,旨在解决现有基准测试中标签数据规模通常有限的问题。对于四个多模态社交媒体基准进行了实验,包括图像文本关系分类、嘲讽检测、情感分类和仇恨言论检测。结果表明,与之前没有使用评论建模或自训练的最新模型相比,我们的方法进一步提高了性能。