Recent advances in self-supervised learning (SSL) using large models to learn visual representations from natural images are rapidly closing the gap between the results produced by fully supervised learning and those produced by SSL on downstream vision tasks. Inspired by this advancement and primarily motivated by the emergence of tabular and structured document image applications, we investigate which self-supervised pretraining objectives, architectures, and fine-tuning strategies are most effective. To address these questions, we introduce RegCLR, a new self-supervised framework that combines contrastive and regularized methods and is compatible with the standard Vision Transformer architecture. Then, RegCLR is instantiated by integrating masked autoencoders as a representative example of a contrastive method and enhanced Barlow Twins as a representative example of a regularized method with configurable input image augmentations in both branches. Several real-world table recognition scenarios (e.g., extracting tables from document images), ranging from standard Word and Latex documents to even more challenging electronic health records (EHR) computer screen images, have been shown to benefit greatly from the representations learned from this new framework, with detection average-precision (AP) improving relatively by 4.8% for Table, 11.8% for Column, and 11.1% for GUI objects over a previous fully supervised baseline on real-world EHR screen images.
翻译:利用大型模型从自然图像中获取视觉表现的自我监督学习(SSL)最近的进展,利用大型模型从自然图像中获取视觉表现,正在迅速缩小通过充分监督学习产生的结果与SSL在下游愿景任务方面产生的结果之间的差距。受这一进展的启发,并主要受表格和结构化文件图像应用程序的出现所推动,我们调查哪些自监督的预培训目标、架构和微调战略最为有效。为了解决这些问题,我们引入了RegCLR,这是一个新的自我监督框架,将对比和常规化方法相结合,并与标准愿景变异结构相兼容。然后,RegCLR通过将蒙面自动编码器作为对比方法的一个有代表性的例子,并增强Barlow 双胞胎作为常规化方法的一个有代表性的例子,同时在两个分支中将可配置投入图像放大。一些真实世界表识别情景(例如,从文件图像中提取表格),从标准Word和Latex文档到更具挑战性的电子健康记录(EHR)计算机屏幕图像,从而大大获益于从这一新框架所学的图像中学习的图象,通过对4.8%的图像进行初步的检测,通过前一级扫描式扫描式图像进行第11号的扫描式扫描式扫描式图像,通过第8,通过第11号扫描式扫描式扫描式扫描式扫描式扫描式扫描式屏幕改进了第8。