We present a bi-encoder framework for named entity recognition (NER), which applies contrastive learning to map candidate text spans and entity types into the same vector representation space. Prior work predominantly approaches NER as sequence labeling or span classification. We instead frame NER as a representation learning problem that maximizes the similarity between the vector representations of an entity mention and its type. This makes it easy to handle nested and flat NER alike, and can better leverage noisy self-supervision signals. A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions. Instead of explicitly labeling all non-entity spans as the same class $\texttt{Outside}$ ($\texttt{O}$) as in most prior methods, we introduce a novel dynamic thresholding loss. Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA). We release the code at github.com/microsoft/binder.
翻译:我们提出了一个名为实体识别(NER)的双编码框架,将对比性学习用于将候选文本和实体类型映射成相同的矢量代表空间。先前的工作主要是将 NER 定位为序列标签或跨分类。我们把NER 设置为代表性学习问题,使一个实体的矢量表达形式及其类型之间具有最大的相似性。这使得更容易处理巢式和扁式 NER,并能够更好地利用噪音的自我监督信号。 NER 的双编码配制面临一个重大挑战,即将非实体范围与所提到实体分隔开来。在大多数先前的方法中,所有非实体都明确标为同一等级$\ textt{Oside} ($\ textt{O}$) 。我们引入了新的动态阈值损失。实验表明,我们的方法在被监视和远监视的环境中运行良好,无论是对于巢状还是平坦式的NER,在一般领域(例如2004年,ACE, AC,2005年)和高价值的垂直数据序列,例如全球GEN,BA,JGRA,例如全球,BI,BA,BA。