We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.
翻译:我们为政治意识形态预测提出了一个新的、监督监督的学习方法,能够预测分配之外的投入。这个问题的起因是人工数据标签费用昂贵,而自我报告的标签往往稀缺,并显示出重要的选择偏差。我们提出了一个新的统计模型,将文件分解成两个矢量的线性叠加;一个独立于意识形态的隐性中立的矢量,以及一个与意识形态相匹配的潜在的 emph{position 矢量。我们训练了一个端到端模型,该模型具有中间背景矢量和定位矢量作为产出。在部署时,我们的模型预测输入文件的标签完全利用预测的定位矢量。在两个基准数据集上,我们显示我们的模型即使在经过培训时能够产出预测到两个矢量的偏差数据,而且远比最新数据更精确得多。通过众包,我们验证了背景矢量的中立性,并显示意识形态集中的环境过滤结果,允许对分布范围外的示例进行预测。