哈佛USPTO专利数据集:大型、结构完善和多用途专利应用公司 (The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications)

from arxiv, Website: https://patentdataset.org/, GitHub Repository: https://github.com/suzgunmirac/hupd, Hugging Face Datasets: https://huggingface.co/datasets/HUPD/hupd

Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.

翻译：2004年至2018年,专利和专利应用的半结构化数据包含了关于许多种类创新的信息。尽管专利数据中体现的创新的影响和新颖性很难通过传统手段测量。尽管专利数据中体现的创新影响和新颖性很难测量,但ML提供了一套有希望的技术,用于评价新颖性、总结贡献和嵌入语义。在本文中,我们引入了哈佛USPTO专利数据集(HUPD),这是向美国专利和商标局(USPTO)提交的大规模、结构完善和多用途的英语专利应用程序。在2004年至2018年期间,HUPDD拥有450万多份专利文件,其影响和新颖的创新创新创新,HUPDD比可比公司大2至3倍。与以前在NLP中提议的专利数据集组合不同,HUPD包含发明者提交的专利应用版本,而不是授予专利的最后版本。让我们在首次使用NLPM方法提交专利时研究专利时研究专利的专利的专利的专利可专利性。我们如何将丰富的结构化的元数据纳入专利存档文本存档文本:通过提供每个应用的每个应用的专利数据元数据,从而显示我们使用新版本的版本的版本的系统化数据库的版本的版本,从而可以进行新的版本的系统化研究。