Many areas, such as the biological and healthcare domain, artistic works, and organization names, have nested, overlapping, discontinuous entity mentions that may even be syntactically or semantically ambiguous in practice. Traditional sequence tagging algorithms are unable to recognize these complex mentions because they may violate the assumptions upon which sequence tagging schemes are founded. In this paper, we describe our contribution to SemEval 2022 Task 11 on identifying such complex Named Entities. We have leveraged the ensemble of multiple ELECTRA-based models that were exclusively pretrained on the Bangla language with the performance of ELECTRA-based models pretrained on English to achieve competitive performance on the Track-11. Besides providing a system description, we will also present the outcomes of our experiments on architectural decisions, dataset augmentations, and post-competition findings.
翻译:许多领域,如生物和保健领域、艺术作品和组织名称等,都筑巢、重叠、不连续的实体提到,实际上甚至可能具有合成或语义上的模糊性;传统的序列标记算法无法承认这些复杂之处,因为它们可能违反建立序列标记办法所依据的假设;在本文件中,我们描述了我们对SemEval 2022任务11关于确定此类复杂命名实体的贡献;我们利用了多种基于ELECTRA的模型组合,这些模型完全在孟加拉语方面受过预先培训,而基于ELECTRA的模型则在英语方面经过预先培训,以达到11轨的竞争性性能;我们除了提供系统描述外,还将介绍我们在建筑决策、数据集增强和竞争后发现方面的实验结果。