Building real-world complex Named Entity Recognition (NER) systems is a challenging task. This is due to the complexity and ambiguity of named entities that appear in various contexts such as short input sentences, emerging entities, and complex entities. Besides, real-world queries are mostly malformed, as they can be code-mixed or multilingual, among other scenarios. In this paper, we introduce our submitted system to the Multilingual Complex Named Entity Recognition (MultiCoNER) shared task. We approach the complex NER for multilingual and code-mixed queries, by relying on the contextualized representation provided by the multilingual Transformer XLM-RoBERTa. In addition to the CRF-based token classification layer, we incorporate a span classification loss to recognize named entities spans. Furthermore, we use a self-training mechanism to generate weakly-annotated data from a large unlabeled dataset. Our proposed system is ranked 6th and 8th in the multilingual and code-mixed MultiCoNER's tracks respectively.
翻译:建设真实世界复杂命名实体识别系统是一项艰巨的任务,其原因是,在诸如短输入句、新兴实体和复杂实体等各种背景下出现的被点名实体的复杂性和模糊性。 此外,真实世界查询大多是扭曲的,因为它们可以是代码混合的或多语种的。在本文中,我们将我们提交的系统引入多语言复合名称实体识别系统(多语种实体识别系统)的共同任务。我们利用多语种变换器XLM-ROBERTA提供的背景化说明,对多语言和编码混合查询采用复杂的NER方法。除了基于通用报告格式的代号分类层外,我们还纳入一个跨区域分类损失,以识别被点名实体。此外,我们使用自我培训机制从一个大型无标签数据集中生成微弱的附加说明数据。我们提议的系统分别在多语种和编码混合多语种CONER的轨道上排第6和第8位。