自行学习进度,以改进缺少标签的历史文件的文本行检测 (Self-paced learning to improve text row detection in historical documents with missing labels)

An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.

翻译：光学字符识别系统的一个重要初步步骤是检测文本行。为了在缺少标签的历史数据背景下应对这项任务,我们建议了一种能够改进行探测性能的自定进度学习算法。我们推测,带有更多地面图解的框的页面不太可能缺少注释。基于这一假设,我们按照地面图解捆绑框的数量按降序排序培训实例,并将其分为 k 批次。我们使用自定速度的学习方法,在 k 迭代上培训行探测器,逐步增加地貌描述较少的批次。在每次迭代中,我们使用非最大抑制手段将地面图解绑绑绑框与假图框(模型本身预测的框)结合起来,并在下一次培训中加入由此得出的说明。我们证明,我们自定速度的学习战略在两套历史文件上取得了显著的性能收益,提高了YOLOv4的平均精确度,在一套数据集上增加了12%以上,在另一套数据中增加了39%。