Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.
翻译:医学实体识别(MedER)是从医学语料库中提取有意义实体的关键自然语言处理任务。当前,基于MedER的研究成果能够显著促进医疗领域自动化系统的发展,最终提升患者护理与治疗结果。尽管针对英语MedER已开展广泛研究,但如孟加拉语等低资源语言在该领域仍探索不足。本研究旨在弥合这一差距。针对孟加拉语医学实体识别,本研究首先评估了包括BERT、DistilBERT、ELECTRA和RoBERTa在内的多种Transformer模型。我们进一步提出了一种创新的多BERT集成方法,该方法以89.58%的最高准确率超越所有基线模型。值得注意的是,相较于单层BERT模型,该方法实现了11.80%的准确率提升,充分证明了其在该任务中的有效性。低资源语言MedER面临的主要挑战在于缺乏标注数据集。为解决此问题,我们专门为孟加拉语MedER任务开发了高质量数据集。通过多项性能指标评估,该数据集验证了我们模型的有效性,展现了其鲁棒性与适用性。我们的研究结果凸显了多BERT集成模型在改进孟加拉语MedER方面的潜力,并为低资源医疗自然语言处理领域的进一步发展奠定了基础。