Named entity recognition (NER) is a natural language processing task (NLP), which aims to identify named entities and classify them like person, location, organization, etc. In the Arabic language, we can find a considerable size of unstructured data, and it needs to different preprocessing tool than languages like (English, Russian, German...). From this point, we can note the importance of building a new structured dataset to solve the lack of structured data. In this work, we use the BIOES format to tag the word, which allows us to handle the nested name entity that consists of more than one sentence and define the start and the end of the name. The dataset consists of more than thirty-six thousand records. In addition, this work proposes long short term memory (LSTM) units and Gated Recurrent Units (GRU) for building the named entity recognition model in the Arabic language. The models give an approximately good result (80%) because LSTM and GRU models can find the relationships between the words of the sentence. Also, use a new library from Google, which is Trax and platform Colab
翻译:命名实体识别(NER)是一项自然语言处理任务(NLP),旨在识别命名实体并将其分类为人物,地点,组织等。在阿拉伯语中,我们可以发现大量的非结构化数据,它需要不同于(英语,俄语,德语等)语言的预处理工具。从这一点可以看出,建立一个新的结构化数据集来解决结构化数据缺失问题的重要性。在这项工作中,我们使用BIOES格式标记单词,这使我们能够处理由多个句子组成的嵌套命名实体并定义名称的开头和结尾。该数据集包括超过三万六千条记录。此外,本论文提出使用LSTM单元和门控循环单元(GRU)构建阿拉伯语命名实体识别模型。模型可以获得近似良好的结果(80%),因为LSTM和GRU模型可以发现句子中的单词之间的关系。此外,使用Google的新库Trax和Colab平台。