Social media is increasingly used for large-scale population predictions, such as estimating community health statistics. However, social media users are not typically a representative sample of the intended population -- a "selection bias". Within the social sciences, such a bias is typically addressed with restratification techniques, where observations are reweighted according to how under- or over-sampled their socio-demographic groups are. Yet, restratifaction is rarely evaluated for improving prediction. Across four tasks of predicting U.S. county population health statistics from Twitter, we find standard restratification techniques provide no improvement and often degrade prediction accuracies. The core reasons for this seems to be both shrunken estimates (reduced variance of model predicted values) and sparse estimates of each population's socio-demographics. We thus develop and evaluate three methods to address these problems: estimator redistribution to account for shrinking, and adaptive binning and informed smoothing to handle sparse socio-demographic estimates. We show that each of these methods significantly outperforms the standard restratification approaches. Combining approaches, we find substantial improvements over non-restratified models, yielding a 53.0% increase in predictive accuracy (R^2) in the case of surveyed life satisfaction, and a 17.8% average increase across all tasks.
翻译:社会媒体越来越多地用于大规模人口预测,例如估计社区健康统计。然而,社会媒体用户通常不是预定人口的代表性抽样,而是“选择性偏差”。在社会科学中,这种偏差通常通过再分配技术来解决,根据社会人口群体如何抽样或过多地对观测结果进行重新加权。然而,很少为改进预测而评估歇斯底里行动。在从Twitter预测美国州人口健康统计的四项任务中,我们发现标准的休养技术没有提供改进,而且常常降低预测的准确性。 这样做的核心原因似乎既包括粗略的估计数(模型预测值的变小),也包括每个人口的社会人口群的少估计数。因此,我们制定和评价解决这些问题的三种方法:估计重新分配以核算萎缩、适应性混合和知情地处理稀少的社会人口估计。我们发现,这些方法中的每一种方法都大大优于标准的再分配方法。我们发现,各种方法都大大改进了非累进模型,导致平均满意度增加53.8%的准确度。