Prior work on ideology prediction has largely focused on single modalities, i.e., text or images. In this work, we introduce the task of multimodal ideology prediction, where a model predicts binary or five-point scale ideological leanings, given a text-image pair with political content. We first collect five new large-scale datasets with English documents and images along with their ideological leanings, covering news articles from a wide range of US mainstream media and social media posts from Reddit and Twitter. We conduct in-depth analyses of news articles and reveal differences in image content and usage across the political spectrum. Furthermore, we perform extensive experiments and ablation studies, demonstrating the effectiveness of targeted pretraining objectives on different model components. Our best-performing model, a late-fusion architecture pretrained with a triplet objective over multimodal content, outperforms the state-of-the-art text-only model by almost 4% and a strong multimodal baseline with no pretraining by over 3%.
 翻译:先前的意识形态预测工作主要集中于单一模式,即文字或图像。在这项工作中,我们引入了多式意识形态预测任务,模型预测二进制或五点级意识形态倾斜,配有带有政治内容的文本图像。我们首先收集了五个新的大规模数据集,配有英文文件和图像及其意识形态倾斜,覆盖来自美国主流媒体和来自Reddit和Twitter的社交媒体文章。我们深入分析了新闻报道,揭示了政治各界图像内容和使用的差异。此外,我们进行了广泛的实验和扩张研究,展示了不同模型组成部分的定向培训前目标的有效性。我们最优秀的模型,即对多式内容有三重目标的晚融合结构,几乎比仅有文本的状态模型高出4 %, 并且没有超过3 % 的预先培训前的强大多式基准。