孟加拉语医学实体识别：基于多BERT集成方法的孟加拉语医学实体识别 (Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity)

Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.

翻译：医学实体识别（MedER）是从医学语料库中提取有意义实体的关键自然语言处理任务。当前，基于MedER的研究成果能够显著促进医疗领域自动化系统的发展，最终提升患者护理与治疗结果。尽管针对英语MedER已开展广泛研究，但如孟加拉语等低资源语言在该领域仍探索不足。本研究旨在弥合这一差距。针对孟加拉语医学实体识别，本研究首先评估了包括BERT、DistilBERT、ELECTRA和RoBERTa在内的多种Transformer模型。我们进一步提出了一种创新的多BERT集成方法，该方法以89.58%的最高准确率超越所有基线模型。值得注意的是，相较于单层BERT模型，该方法实现了11.80%的准确率提升，充分证明了其在该任务中的有效性。低资源语言MedER面临的主要挑战在于缺乏标注数据集。为解决此问题，我们专门为孟加拉语MedER任务开发了高质量数据集。通过多项性能指标评估，该数据集验证了我们模型的有效性，展现了其鲁棒性与适用性。我们的研究结果凸显了多BERT集成模型在改进孟加拉语MedER方面的潜力，并为低资源医疗自然语言处理领域的进一步发展奠定了基础。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【ICML2025】QuRe：通过困难负样本采样实现查询相关的组合图像检索

专知会员服务

7+阅读 · 7月20日

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

语音识别:不同深度学习方法的综述，Speech Recognition: a review of the different deep learning approaches

专知会员服务

33+阅读 · 2022年3月13日

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

专知会员服务

19+阅读 · 2022年3月1日