Real-world data usually exhibits a long-tailed distribution,with a few frequent labels and a lot of few-shot labels. The study of institution name normalization is a perfect application case showing this phenomenon. There are many institutions worldwide with enormous variations of their names in the publicly available literature. In this work, we first collect a large-scale institution name normalization dataset LoT-insts1, which contains over 25k classes that exhibit a naturally long-tailed distribution. In order to isolate the few-shot and zero-shot learning scenarios from the massive many-shot classes, we construct our test set from four different subsets: many-, medium-, and few-shot sets, as well as a zero-shot open set. We also replicate several important baseline methods on our data, covering a wide range from search-based methods to neural network methods that use the pretrained BERT model. Further, we propose our specially pretrained, BERT-based model that shows better out-of-distribution generalization on few-shot and zero-shot test sets. Compared to other datasets focusing on the long-tailed phenomenon, our dataset has one order of magnitude more training data than the largest existing long-tailed datasets and is naturally long-tailed rather than manually synthesized. We believe it provides an important and different scenario to study this problem. To our best knowledge, this is the first natural language dataset that focuses on long-tailed and open-set classification problems.
翻译:现实世界数据通常显示长尾分发,有少数频繁的标签和大量短片标签。机构名称正常化研究是一个完美应用案例,表明这种现象。全世界有许多机构,其名称在公开文献中有很大差异。在这项工作中,我们首先收集一个大型机构名称正常化数据集Lot-ints1, 包含超过25k类的大型机构名称正常化数据集Lot-T-ints1, 显示自然长尾分发。为了将少发和零发的学习假想与大规模多发类分开,我们从四个不同的子集(多发、中发和少发)构建测试集:多发和少发集,以及零发开放集。我们从四个不同的子集(多发、中发和少发)构建我们的测试集,以及零发开放集。我们还复制了我们数据上的若干重要基线方法,从搜索基于搜索的方法到使用预先训练的BERT模型的神经网络方法。此外,我们提出了我们特别先入手的、基于BERT的模型模型,在少发和零发测试组中显示更好的分配通用通用一般化。比其他数据集更集中,比长尾的多发现象,我们的数据是一长尾的、长的、长程、长程数据,我们最难、最难的、最难的、最难的、最难的顺序、最难的顺序是相信的自然的数据。