Converting data from machine-unreadable formats like PDFs into Markdown has the potential to enhance the accessibility of scientific research. Existing end-to-end decoder transformer models can transform screenshots of PDFs into Markdown, offering more flexibility than pipeline-based methods. Yet, decoding text token by token from scratch is inefficient, especially when dense text can be directly copied from the PDF. To address this challenge, this paper modifies Prompt Lookup Decoding (PLD) to extract candidate sequences directly from PDF files, leveraging the high n-gram overlap between PDFs and their Markdown equivalents. A new method, Copy Lookup Decoding (CLD), is introduced here to enhance PLD's candidate generation mechanism. Experiments demonstrate that CLD can accelerate the conversion process by up to 1.70$\times$ at original quality. The codebase for this paper is open-source on GitHub (https://github.com/Fireblossom/CopyLookup).
翻译:将PDF等机器不可读格式的数据转换为Markdown,有望提升科学研究的可访问性。现有的端到端解码器Transformer模型能够将PDF截图转换为Markdown,相比基于流水线的方法提供了更高的灵活性。然而,从头开始逐词元解码文本效率低下,尤其是在密集文本可直接从PDF中复制的情况下。为应对这一挑战,本文修改了提示查找解码(PLD)方法,使其能够直接从PDF文件中提取候选序列,利用了PDF文件与其Markdown等价文本之间的高n元语法重叠度。本文引入了一种新方法——复制查找解码(CLD),以增强PLD的候选生成机制。实验表明,在保持原始质量的前提下,CLD可将转换过程加速高达1.70倍。本文的代码库已在GitHub上开源(https://github.com/Fireblossom/CopyLookup)。