Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous nature. This project uses data from an anonymised mental health electronic health records database. The data are used to train a machine learning based classification algorithm to classify sentences as discussing patient pain or not. This will facilitate the extraction of relevant pain information from large databases, and the use of such outputs for further studies on pain and mental health. 1,985 documents were manually triple-annotated for creation of gold standard training data, which was used to train three commonly used classification algorithms. The best performing model achieved an F1-score of 0.98 (95% CI 0.98-0.99).
翻译:疼痛是寻求医疗资源的常见原因,也是研究领域的热点,尤其是在其与心理健康的重叠方面。心理健康电子病历是研究该重叠的良好数据来源。然而,痛苦的大量信息存在于这些记录的自由文本中,由于其模糊性质,痛的提及成为了一个独特的自然语言处理问题。本项目使用了来自一个匿名的心理健康电子病历数据库的数据。数据用于训练基于机器学习的分类算法,将句子分类为讨论病人疼痛或不讨论病人疼痛。这将有助于从大型数据库中提取相关疼痛信息,进而将这样的输出用于进一步的疼痛和心理健康研究。共手动对1,985个文档进行了三倍(triple)注释,创建了黄金标准的训练数据,用于训练三种常用的分类算法。最佳模型的F1-score为0.98(95% CI 0.98-0.99)。