Ideology is at the core of political science research. Yet, there still does not exist general-purpose tools to characterize and predict ideology across different genres of text. To this end, we study Pretrained Language Models using novel ideology-driven pretraining objectives that rely on the comparison of articles on the same story written by media of different ideologies. We further collect a large-scale dataset, consisting of more than 3.6M political news articles, for pretraining. Our model POLITICS outperforms strong baselines and the previous state-of-the-art models on ideology prediction and stance detection tasks. Further analyses show that POLITICS is especially good at understanding long or formally written texts, and is also robust in few-shot learning scenarios.
翻译:意识形态是政治科学研究的核心。然而,仍然没有通用工具来辨别和预测不同版本文本的意识形态。为此,我们利用新颖的意识形态驱动的先期培训目标,根据对不同意识形态媒体撰写的同一故事文章的比较,研究先期语言模型。我们进一步收集了一个大型数据集,包括超过3.6M的政治新闻文章,供培训前使用。我们的POLITICS模型比强的基线和以前最先进的意识形态预测和姿态探测任务模型要强得多。进一步的分析表明,POLITICS在理解长期或正式书面文本方面特别出色,在少数短短的学习情景中也很有力。