Self-supervised pretraining has been able to produce transferable representations for various visual document understanding (VDU) tasks. However, the ability of such representations to adapt to new distribution shifts at test-time has not been studied yet. We propose DocTTA, a novel test-time adaptation approach for documents that leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks where DocTTA improves the source model performance up to 1.79\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively while drastically reducing calibration error on target data.
翻译:自我监督的训练前任务已经能够为各种视觉文件理解(VDU)任务提供可转移的演示。 但是,这种演示在测试时适应新的分布变化的能力尚未研究。 我们提议对文件采用新的测试时间适应方法,即DocTTA, 这是一种测试时间适应方法,它通过隐形视觉语言模型和假标签,使在测试时将“textit{source}域所学模型改成一个未标记的“textit{gain}域”。 我们还采用新的基准,使用现有的公共数据集,用于各种VDU任务,包括实体识别、关键价值提取和文件直观回答任务,即DocTTA将源模型的性能分别提高到1.79 ⁇ (F1分)、3.43 ⁇ (F1分)和17.68 ⁇ (ANLS分),同时大幅降低目标数据的校准误差。