Multi-label learning predicts a subset of labels from a given label set for an unseen instance while considering label correlations. A known challenge with multi-label classification is the long-tailed distribution of labels. Many studies focus on improving the overall predictions of the model and thus do not prioritise tail-end labels. Improving the tail-end label predictions in multi-label classifications of medical text enables the potential to understand patients better and improve care. The knowledge gained by one or more infrequent labels can impact the cause of medical decisions and treatment plans. This research presents variations of concatenated domain-specific language models, including multi-BioMed-Transformers, to achieve two primary goals. First, to improve F1 scores of infrequent labels across multi-label problems, especially with long-tail labels; second, to handle long medical text and multi-sourced electronic health records (EHRs), a challenging task for standard transformers designed to work on short input sequences. A vital contribution of this research is new state-of-the-art (SOTA) results obtained using TransformerXL for predicting medical codes. A variety of experiments are performed on the Medical Information Mart for Intensive Care (MIMIC-III) database. Results show that concatenated BioMed-Transformers outperform standard transformers in terms of overall micro and macro F1 scores and individual F1 scores of tail-end labels, while incurring lower training times than existing transformer-based solutions for long input sequences.
翻译:多标签学习预测了一组标签,这些标签来自一个特定标签,可以用于无形的标签,同时考虑标签的关联性。多标签分类的一个已知挑战就是标签的长尾分布。许多研究的重点是改进模型的整体预测,因此不优先考虑尾尾端标签。改进医学文本多标签分类中的尾端标签预测,从而有可能更好理解病人,改善护理。一个或一个以上不常见标签获得的知识,可以影响医疗决定和治疗计划的成因。这一研究展示了包括多比Med-Transexex在内的特定域域语言模型的变异,以实现两个主要目标。首先,改进多标签问题(特别是长尾标签)的F1类的不常见标签分数;第二,处理长的医学文本和多源电子健康记录(EHRs),这是设计用于短期输入序列序列的标准化变异变异器的任务。这一研究的一个重要贡献是利用FReveralXM1的变异性分类获得的新状态-艺术结果,用于预测FRISM的累进矩阵的内流序。