The citation network of patents citing prior art arises from the legal obligation of patent applicants to properly disclose their invention. One way to study the relationship between current patents and their antecedents is by analyzing the similarity between the textual elements of patents. Many patent similarity indicators have shown a constant decrease since the mid-70s. Although several explanations have been proposed, more comprehensive analyses of this phenomenon have been rare. In this paper, we use a computationally efficient measure of patent similarity scores that leverages state-of-the-art Natural Language Processing tools, to investigate potential drivers of this apparent similarity decrease. This is achieved by modeling patent similarity scores by means of general additive models. We found that non-linear modeling specifications are able to distinguish between distinct, temporally varying drivers of the patent similarity levels that explain more variation in the data ($R^2\sim 18\%$) compared to previous methods. Moreover, the model reveals an underlying trend in similarity scores that is fundamentally different from the one presented in previously.
翻译:专利的引证网络引用了先前的艺术,其原因是专利申请者有适当披露其发明的法律义务。研究当前专利与其前身之间关系的一种方法是分析专利文本要素之间的相似性。许多专利相似性指标显示自70年代中期以来,专利相似性指标持续下降。虽然提出了若干解释,但很少对这一现象进行更全面的分析。在本文中,我们使用一种利用最新自然语言处理工具的计算效率的专利相似性分数计量方法,调查这种明显相似性下降的潜在驱动因素。这是通过一般添加模型对专利相似性分数进行模型化实现的。我们发现,非线性模型规格能够区分专利相似性水平的不同、时间差异的驱动因素,这些驱动因素解释数据(R2\sim 18 ⁇ $)与以往方法的差异更大。此外,模型揭示了类似性分数的基本趋势,与以往的相似性分数有根本的区别。