Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous nature. This project uses data from an anonymised mental health electronic health records database. The data are used to train a machine learning based classification algorithm to classify sentences as discussing patient pain or not. This will facilitate the extraction of relevant pain information from large databases, and the use of such outputs for further studies on pain and mental health. 1,985 documents were manually triple-annotated for creation of gold standard training data, which was used to train three commonly used classification algorithms. The best performing model achieved an F1-score of 0.98 (95% CI 0.98-0.99).
翻译:痛苦是访问医疗资源的常见原因,也是一个与心理健康重叠的研究领域。心理健康电子健康记录是研究此重叠的良好数据来源。然而,许多关于疼痛的信息都保存在这些记录的自由文本中,由于其模糊的性质,疼痛的提及会产生独特的自然语言处理问题。本项目使用来自匿名的心理健康电子健康记录数据库的数据。使用这些数据训练基于机器学习的分类算法来将句子分类为讨论患者疼痛与否。这将有助于从大型数据库中提取相关疼痛信息,并将这些输出用于进一步研究疼痛和心理健康。共手动三次注释1,985个文档以创建金标准训练数据,用于训练三种常用的分类算法。最佳性能模型的 F1 分数为 0.98(95% CI 0.98-0.99)。