We present an efficient bi-encoder framework for named entity recognition (NER), which applies contrastive learning to map candidate text spans and entity types into the same vector representation space. Prior work predominantly approaches NER as sequence labeling or span classification. We instead frame NER as a metric learning problem that maximizes the similarity between the vector representations of an entity mention and its type. This makes it easy to handle nested and flat NER alike, and can better leverage noisy self-supervision signals. A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions. Instead of explicitly labeling all non-entity spans as the same class Outside (O) as in most prior methods, we introduce a novel dynamic thresholding loss, which is learned in conjunction with the standard contrastive loss. Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA).
翻译:我们为命名实体识别(NER)提出了一个高效的双编码框架,将对比性学习用于将候选文本和实体类型映射成相同的矢量代表空间,以往的工作主要是将NER作为序列标签或跨分类的序列标签或跨分类。我们把NER设为一个标准学习问题,使一个实体的矢量表达方式及其类型之间的相似性最大化。这使得更容易处理嵌套式和平坦的 NER,并能更好地利用噪音的自我监督信号。对于NER 的双编码配方而言,一个重大挑战在于将非实体范围与所述实体分隔开来。我们没有将所有非实体范围明确标为外部(O)的同一类别,而是采用了与标准对比损失一起学习的新的动态阈值损失。实验表明,我们的方法在受监管和远程监督的环境中运行良好,对于嵌套式和平坦式的NER,在一般领域(例如ACE2004、ACE2005)和高价值的垂直数据集之间建立了新的状态。我们没有像以前的方法那样将所有非实体标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度与标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度与标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度与标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度标度