Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), which uses both grayscale and OCR-features. We focus our efforts on translating the intersection-over-union (IOU) metric from the field of object detection to document layout analysis and quantify "high localization" levels as an IOU of 0.9. When applied to the astrophysics literature holdings of the NASA Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the IOU cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.
翻译:1990年代后期“数字化时代”前发表的科学文章含有扫描页内“固定”的数字。虽然在提取数字及其说明方面取得了进展,但目前没有强有力的方法。我们提出了一个以YOLO为基础的方法,供扫描页使用,在用光学特征识别(OCR)处理后使用,该方法使用灰度和光化光化特征。我们集中努力将物体探测领域的交叉重叠(IOU)指标翻译为文件布局分析,并将“高度本地化”等级量化为0.9。在应用美国航天局天体物理学数据系统的天体物理学文献记录时,我们发现数字(图示)的F1分为90.9%(92.2%),而IOU截断0.9是与其他最新方法相比的一项重大改进。