Generative models, such as GPT-2, have demonstrated impressive results recently. A fundamental question we'd like to address is: where did the generated text come from? This work is our initial effort toward answering the question by using prior art search. The purpose of the prior art search is to find the most similar prior text in the training data of GPT-2. We take a reranking approach and apply it to the patent domain. Specifically, we pre-train GPT-2 models from scratch by using the patent data from the USPTO. The input for the prior art search is the patent text generated by the GPT-2 model. We also pre-trained BERT models from scratch for converting patent text to embeddings. The steps of reranking are: (1) search the most similar text in the training data of GPT-2 by taking a bag-of-word ranking approach (BM25), (2) convert the search results in text format to BERT embeddings, and (3) provide the final result by ranking the BERT embeddings based on their similarities with the patent text generated by GPT-2. The experiments in this work show that such reranking is better than ranking with embeddings alone. However, our mixed results also indicate that calculating the semantic similarities among long text spans is still challenging. To our knowledge, this work is the first to implement a reranking system to identify retrospectively the most similar inputs to a GPT model based on its output.
翻译:例如 GPT-2 等生成模型最近显示了令人印象深刻的结果。 我们想要解决的一个基本问题是: 生成的文本来自何方? 这项工作是我们最初努力通过使用先前的艺术搜索来回答问题。 先前艺术搜索的目的是在GPT-2 的培训数据中找到最相似的先前文本。 我们采取重新排序方法, 并将其应用于专利领域。 具体地说, 我们从零开始将 GPT-2 模型从零开始, 使用USPTO 的专利数据进行预培训。 之前艺术搜索的输入是 GPT-2 模型产生的专利文本。 我们还预先训练了BERT 模型, 以便把专利文本转换成嵌入嵌入。 重新排序的步骤是:(1) 通过采用一袋字母排序法( B25) 搜索GPT-2 培训数据中最相似的文本。 (2) 将文本格式的搜索结果转换为 BERT 嵌入最后的结果是根据它们与 GPT-2 模型生成的专利文本的相似之处进行排序。 我们的实验显示, 将BERT-2 模型的模型从从抓取模型从抓分到将专利文本转换为嵌入到嵌入到嵌入。 。 的模型的模型的模型的模型中, 步骤步骤的顺序比我们长的顺序比重的顺序比重的顺序要好得多的顺序, 。