Objective: This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts. Materials and Methods: The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. Results: We report that Macro- and Micro-averaged F_1 scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on ground truth labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Discussion: We highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. Conclusions: Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline.
翻译:本研究的目标:本研究旨在开发一个端到端的自然语言处理管道,用于从病人使用的社交媒体站点进行分类和诊断COVID-19。 材料和方法: 文本处理管道首先从病人站点中提取COVID-19症状和相关概念,例如严重性、持续性、否定性和人体部位,使用有条件随机字段。 然后应用一个未经监督的基于规则的算法来建立下一阶段管道中的概念之间的关系。 提取的概念和关系随后用于构建每个站点的两种不同的矢量代表。 这些矢量被分别用于构建支持矢量机学习模型,将病人分为三类,并诊断为COVI-19。 结果: 我们报告宏观和微观平均F_1分在71%到96%和61-87%之间,分别用于对COVI-19进行分类和诊断。 我们的实验结果表明,当模型使用概念提取和基于规则的分类器的预测标签进行训练时,可以实现相似的绩效,从而得出三种类别的矢量学习模式,从而最终诊断为COVI-19。 结果: 我们通过最经常的机器的诊断性研究,我们发现另一个重要的特征,我们所找到的、最经常的机器分析,我们所发现的另一重要研究的模型显示。