Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population -- a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. In this two-part study, we first evaluate standard, "out-of-the-box" restratification techniques, finding they provide no improvement and often even degraded prediction accuracies across four tasks of esimating U.S. county population health statistics from Twitter. The core reasons for degraded performance seem to be tied to their reliance on either sparse or shrunken estimates of each population's socio-demographics. In the second part of our study, we develop and evaluate Robust Poststratification, which consists of three methods to address these problems: (1) estimator redistribution to account for shrinking, as well as (2) adaptive binning and (3) informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods leads to significant improvement in prediction accuracies over the standard restratification approaches. Taken together, Robust Poststratification enables state-of-the-art prediction accuracies, yielding a 53.0% increase in variance explained (R^2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks.
翻译:社会媒体越来越多地用于大规模人口预测,例如估计社区健康统计。然而,社会媒体使用者通常不是预定人口的代表性抽样,而是“选择偏差”。在社会科学中,这种偏差通常通过再分技术来解决,根据社会人口群体如何被少少或过多地抽样,对观察结果进行重新加权。然而,很少评估消减行动来改进预测。在这个由两部分组成的研究中,我们首先评价标准,即“箱外”休息技术,发现它们没有提供改进,甚至往往退化了四个任务中的预测质量,这四个任务就是“选择美国.S.县人口健康统计”。在社会科学中,这种偏差现象通常通过再分技术来解决。在社会人口群体的社会人口群体中,观察结果似乎与它们依赖的稀少或粗略估计有关。在我们研究的第二部分中,我们制定和评估“罗布斯特post Contalization”,这包括解决这些问题的三种方法:(1) 估算性再分配,以计算收缩,以及(2) 调整性硬性硬性硬性硬性调整,以及(3) 平稳地处理低度的社会-人口统计估计。 我们在每个案例的预测中,每个分析中,每个分析中都有这些方法,可以解释。