The lack of generalizability -- in which a model trained on one dataset cannot provide accurate results for a different dataset -- is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While several solutions have been proposed, including newer and updated deep learning models, larger hand-annotated datasets, and the generation of large synthetic datasets, so far there is no "magic bullet" for translating a model trained on a particular domain or historical time period to a new field. Here we present our ongoing work in translating our document layout analysis model from the historical astrophysical literature to the larger corpus of scientific documents within the HathiTrust U.S. Federal Documents collection. We use this example as an avenue to highlight some of the problems with generalizability in the document layout analysis community and discuss several challenges and possible solutions to address these issues. All code for this work is available on The Reading Time Machine GitHub repository (https://github.com/ReadingTimeMachine/htrc_short_conf).
翻译:缺乏普遍性 -- -- 在一个数据集上培训的模型无法为不同的数据集提供准确结果 -- -- 是文件布局分析领域的一个已知问题。因此,当一个模型用于定位科学文献中的重要页面对象,如图表、表格、标题和数学公式时,该模型往往无法成功地应用于新的领域。虽然提出了几种解决办法,包括更新和更新的深层学习模型、更大的手语附加说明数据集和生成大型合成数据集,但目前还没有将特定领域或历史时期培训的模型翻译到新的领域的任何“神奇子弹”。在这里,我们介绍我们正在进行的工作,将我们的文件布局分析模型从历史天体物理文献翻译成Hathit Trust U.S. Federal Documents收藏中更广泛的科学文件。我们用这个例子来突出文件布局分析界的一些通用问题,并讨论解决这些问题的几种挑战和可能的解决办法。这项工作的所有代码都可以在读取时间机器 GithHubs_hormaine_ReadtingMonchys上(https://greatyMconfachs)。