In this paper, we introduce a reproducible cleaning process for the text extracted from PDFs using n-gram models. Our approach compares the originally extracted text with the text generated from, or expected by, these models using earlier text as stimulus. To guide this process, we introduce the notion of a consistency score, which refers to the proportion of text that is expected by the model. This is used to monitor changes during the cleaning process, and across different corpuses. We illustrate our process on text from the book Jane Eyre and introduce both a Shiny application and an R package to make our process easier for others to adopt.
翻译:在本文中,我们采用n-gram 模型对从PDF中提取的文本采用可复制的清理程序。我们的方法是将最初提取的文本与这些模型中生成的或预期的文本进行比较,使用较早的文本作为刺激因素。为了指导这一过程,我们引入了一致性评分的概念,它指的是该模型所期待的文本比例。它用来监测清理过程中和不同领域的变化。我们用《简易》一书中的文本来说明我们的过程,并引入了“Shiny”应用程序和“R”软件包,以使其他人更容易采用我们的过程。