This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.
翻译:本文提出一种新方法,即OFA-OCR,将多式预先培训模式转换为文本识别,具体地说,我们将文本识别作为图像说明,直接将统一的愿景预培训模式转换为最终任务,不就大规模附加说明或合成文本识别数据进行预先培训,OFA-OCR将超越基线,在中文文本识别基准中实现最新业绩。此外,我们与OFA-OCR建造了OCR管道,我们证明它能够实现产品级API的竞争性性能。代码(https://github.com/OFA-Sys/OFA)和演示(https://modelscope.cn/studioos/damamo/ofa_ocr_ipline/summary)已经公布。