Extracting metadata from scientific papers can be considered a solved problem in NLP due to the high accuracy of state-of-the-art methods. However, this does not apply to German scientific publications, which have a variety of styles and layouts. In contrast to most of the English scientific publications that follow standard and simple layouts, the order, content, position and size of metadata in German publications vary greatly among publications. This variety makes traditional NLP methods fail to accurately extract metadata from these publications. In this paper, we present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image. We used Mask R-CNN that is trained on COCO dataset and finetuned with PubLayNet dataset that consists of ~200K PDF snapshots with five basic classes (e.g. text, figure, etc). We refine-tuned the model on our proposed synthetic dataset consisting of ~30K article snapshots to extract nine patterns (i.e. author, title, etc). Our synthetic dataset is generated using contents in both languages German and English and a finite set of challenging templates obtained from German publications. Our method achieved an average accuracy of around $90\%$ which validates its capability to accurately extract metadata from a variety of PDF documents with challenging templates.
翻译:从科学论文中提取元数据可被视为国家实验室方案的一个问题,因为最新技术方法的精准性很高。然而,这不适用于德国科学出版物,因为德国科学出版物具有各种各样的风格和布局。与大多数遵循标准和简单布局的英国科学出版物相比,德国出版物中元数据的顺序、内容、位置和规模在五个基本类别(如文本、图等)中差异很大。这种多样性使得传统的国家实验室方案方法无法准确地从这些出版物中提取元数据。在本文中,我们提出一种方法,从PDF文件中提取具有不同版式和风格的元数据。我们使用对COCO数据集进行训练并经过PubLayNet数据集微调的Mack R-CNN,该软件包括PbLayNet数据集,该数据集包括~200K PDF照片。我们改进了由~30K文章缩略图组成的拟议合成数据集模型模型,以提取9种模式(即作者、标题等)。我们合成数据集是用德国和英语两种语言的内容生成的,并用PLayNet数据集数据集的精细化模型,从德国的平均版本中获得了具有挑战性的出版物。