Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on webpages using three novel objectives that leverage the spatial and semantic information in the document images: Masked Document Content Generation Task, Bounding Box Task, and Rendered Question Answering Task. We evaluate our model on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. We show that our model achieves competitive or better results than the state-of-the-art models on these tasks. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SOTA models on DocVQA and AI2D datasets by significant margins, 2% and 21% increase in performance, respectively. Also, DUBLIN is the first ever pixel-based model which achieves comparable to text-based SOTA methods on XFUND dataset for Semantic Entity Recognition showcasing its multilingual capability. Moreover, we create new baselines for text-based datasets by rendering them as document images and applying this model.
翻译:暂无翻译