In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection and provide preliminary experiments for binary classification using different methods of zero- and few-shot learning based on a multi-lingual model. When fine-tuning XLM-RoBERTa first on English patient forum data and then on the new German data, we achieve an F1-score of 37.52 for the positive class. We make the dataset and models publicly available for the community.
翻译:在这项工作中,我们展示了在患者生成的内容中检测德国不良药物反应(ADR)的第一组数据。数据包括来自德国患者论坛的4,169份附加说明的二进制文件,在该论坛中,用户谈论健康问题并从医生那里获得建议。正如在社会媒体数据中常见的,该物质的分类标签非常不平衡。这个和高主题不平衡使得它成为一个非常具有挑战性的数据集,因为同一症状往往有几种原因,而且并不总是与药物摄入有关。我们的目标是鼓励在ADR检测领域进一步开展多语言工作,并采用基于多种语言模式的零和零点学方法为二进制分类提供初步实验。当微调XLM-ROBERTA先在英语患者论坛数据上进行微调,然后在新的德国数据上进行微调时,我们为正面的类别实现了37.52的F1点。我们向社区公开了数据集和模型。