We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.
翻译:我们开发了一种适用于同时处理专利与科学文献的语言相似性模型。在赛马式评估中,我们对八种语言(相似性)模型进行了测试,以预测可信的专利-论文引用关系。研究发现,我们提出的Pat-SPECTER模型(即基于专利数据微调的SPECTER2模型)表现最佳。通过两个实际应用场景(专利-论文对分离与专利-论文对预测),我们验证了Pat-SPECTER模型的性能。最后,我们检验了以下假设:由于诚信义务条款的影响,美国专利引用的论文在语义相似度上低于其他主要专利管辖区的引用情况。本模型已向学术界及从业者开放使用。