ClapperText：面向低资源档案文档文本识别的基准数据集 (ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents)

This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText's suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at https://github.com/linty5/ClapperText.

翻译：本文提出ClapperText，一个面向视觉退化且资源匮乏场景下手写与印刷体文本识别的基准数据集。该数据集源自127段二战时期档案视频片段，其中包含记录结构化制作元数据（如日期、地点、摄像师身份）的场记板。ClapperText包含9,813个标注帧与94,573个单词级文本实例，其中67%为手写体，1,566个存在部分遮挡。每个实例均包含转写文本、语义类别、文本类型及遮挡状态，标注以支持空间精确OCR应用的四点多边形旋转边界框形式提供。场记板文本识别面临显著挑战，包括运动模糊、手写笔迹差异、曝光波动及杂乱背景，这反映了历史文档分析中结构化内容以退化、非标准形式呈现的普遍难题。我们提供全帧标注与裁剪单词图像以支持下游任务。通过采用统一的每视频评估协议，我们在零样本与微调条件下对六种代表性识别模型与七种检测模型进行了基准测试。尽管训练集规模较小（18段视频），微调仍带来显著的性能提升，凸显了ClapperText在少样本学习场景中的适用性。本数据集为推进低资源档案场景下鲁棒OCR与文档理解技术提供了真实且具文化根基的研究资源。数据集与评估代码已发布于https://github.com/linty5/ClapperText。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日