We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.
翻译:我们提出了一个将变压器语言模型纳入传染病建模的新颖方法,通过跟踪美国各州COVID-19子改编中Reddit 员额高密度的句级代表群量化了文本生成的特征。我们将这些嵌入的嵌入特征与其他高质量数据集的特征进行对比。在一项临界分级任务中,我们表明它们在预测上升趋势信号方面优于所有其他特征类型,这是在流行病学数据不可靠的地区建立传染病建模的重要结果。 随后,在一项时间序列预测任务中,我们充分利用了案例量的预测力,并比较了使用不同补充数据集作为以变换器为基础的时间序列模型的共变式特征的相对优势。