The Arabic language suffers from a great shortage of datasets suitable for training deep learning models, and the existing ones include general non-specialized classifications. In this work, we introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites, in addition to the Arab Medical Encyclopedia. The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological) diseases. Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
翻译:在这项工作中,我们引入了一个新的阿拉伯医疗数据集,其中包括除阿拉伯医学百科全书之外,从几个阿拉伯医学网站收集的2 000份医疗文件,该数据集是用来对文本进行分类的,包括10类疾病(布卢德、博内、心血管、耳耳、内分泌、眼、肠道、免疫、利物和神经病)。