Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned (or pre-trained) over the training data generated by running the UA over large unlabeled corpora. Our experiments demonstrate that our setup can improve the predictive performance while decreasing the inference latency on both CPUs and GPUs. Our annotators provide a very competitive baseline for all the cases where annotations are not available.
翻译:信息提取是知识发现平台的根基。 开发能够广泛分布于千差万别和可能高度技术领域的术语提取器具有挑战性, 因为需要深入专业知识的领域的说明非常稀少,而且要获取的成本也非常昂贵。 在本文中, 我们描述一个商业知识发现平台的提取子系统, 该平台针对的是高技术领域, 如药理、医疗和材料科学。 为了能够将范围广泛化, 我们引入一个完全不受监督的批注器( UA) 。 它通过将子词符号的新型形态学信号与术语对主题和术语内部相似性指标相结合来提取术语。 使用通用的预先培训的句子输入器来计算。 该批注用于实施一个薄弱的监控设置, 使变压器模型对大型未加标签的子公司( UA) 生成的培训数据进行精细调整( 或预先培训)。 我们的实验表明, 我们的设置可以改进预测性功能, 同时减少CUPS和GPOLs的推力拉度指标。 我们的解算器可以提供非常有竞争力的基线。