The current COVID-19 pandemic has lead to the creation of many corpora that facilitate NLP research and downstream applications to help fight the pandemic. However, most of these corpora are exclusively for English. As the pandemic is a global problem, it is worth creating COVID-19 related datasets for languages other than English. In this paper, we present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is annotated for the named entity recognition (NER) task with newly-defined entity types that can be used in other future epidemics. Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets. We empirically conduct experiments using strong baselines on our dataset, and find that: automatic Vietnamese word segmentation helps improve the NER results and the highest performances are obtained by fine-tuning pre-trained language models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) produces higher results than the multilingual model XLM-R (Conneau et al., 2020). We publicly release our dataset at: https://github.com/VinAIResearch/PhoNER_COVID19
翻译:目前COVID-19大流行导致创建了许多公司,促进NLP研究和下游应用,以帮助防治这一流行病。然而,这些公司大多专为英文。由于该流行病是一个全球性问题,因此值得为除英语以外的语言创建COVID-19相关数据集。我们在本文件中介绍了越南第一个人工加注的COVID-19域域特定数据集。特别是,我们的数据集附加了说明,用于指定实体识别(NER)任务,该任务涉及可在今后其他流行病中使用的新界定的实体类型。我们的数据集中,与现有的越南NER数据集相比,实体数量也最多。我们利用我们数据集的强大基线进行实验,发现:越南自动字分割有助于改进NER结果,通过经过精细调整的预先培训的语言模型取得最高绩效,越南(Nguyen和Nguyen,2020年)的单一语言模型PhoBERT产生高于多语种模型XLM-R(ConneVau等人,2020年)。我们在以下网站公开发布我们的数据数据集:https/NER19/Resthusearch.Vcom)。