Although the recent rise and uptake of COVID-19 vaccines in the United States has been encouraging, there continues to be significant vaccine hesitancy in various geographic and demographic clusters of the adult population. Surveys, such as the one conducted by Gallup over the past year, can be useful in determining vaccine hesitancy, but can be expensive to conduct and do not provide real-time data. At the same time, the advent of social media suggests that it may be possible to get vaccine hesitancy signals at an aggregate level (such as at the level of zip codes) by using machine learning models and socioeconomic (and other) features from publicly available sources. It is an open question at present whether such an endeavor is feasible, and how it compares to baselines that only use constant priors. To our knowledge, a proper methodology and evaluation results using real data has also not been presented. In this article, we present such a methodology and experimental study, using publicly available Twitter data collected over the last year. Our goal is not to devise novel machine learning algorithms, but to evaluate existing and established models in a comparative framework. We show that the best models significantly outperform constant priors, and can be set up using open-source tools.
翻译:虽然美国最近COVID-19疫苗的上升和采用令人鼓舞,但在成人人口的各种地理和人口组别中仍然存在着严重的疫苗犹豫不决现象,例如加洛普在过去一年中进行的调查可以有助于确定疫苗犹豫不决,但进行这种调查的费用可能很高,而且不能提供实时数据。与此同时,社交媒体的出现表明,有可能通过使用机器学习模型和公开来源的社会经济(和其他)特征,获得总水平(例如拉链码水平)的疫苗失灵信号。目前,这种努力是否可行以及这种努力如何与仅使用以往不变的基线相比较是一个未决问题。据我们所知,没有提出使用真实数据的适当方法和评价结果。在本篇文章中,我们提出这样一种方法和实验性研究,使用去年收集的公开的Twitter数据。我们的目标不是设计新的机器学习算法,而是在比较框架内评价现有和既定的模型。我们表明,最佳模型大大超越了以往的源源源不断使用的工具,可以建立。