An important preliminary step of optical character recognition systems is the detection of text rows. To address this task in the context of historical data with missing labels, we propose a self-paced learning algorithm capable of improving the row detection performance. We conjecture that pages with more ground-truth bounding boxes are less likely to have missing annotations. Based on this hypothesis, we sort the training examples in descending order with respect to the number of ground-truth bounding boxes, and organize them into k batches. Using our self-paced learning method, we train a row detector over k iterations, progressively adding batches with less ground-truth annotations. At each iteration, we combine the ground-truth bounding boxes with pseudo-bounding boxes (bounding boxes predicted by the model itself) using non-maximum suppression, and we include the resulting annotations at the next training iteration. We demonstrate that our self-paced learning strategy brings significant performance gains on two data sets of historical documents, improving the average precision of YOLOv4 with more than 12% on one data set and 39% on the other.
翻译:光学字符识别系统的一个重要初步步骤是检测文本行。为了在缺少标签的历史数据背景下应对这项任务,我们建议了一种能够改进行探测性能的自定进度学习算法。我们推测,带有更多地面图解的框的页面不太可能缺少注释。基于这一假设,我们按照地面图解捆绑框的数量按降序排序培训实例,并将其分为 k 批次。我们使用自定速度的学习方法,在 k 迭代上培训行探测器,逐步增加地貌描述较少的批次。在每次迭代中,我们使用非最大抑制手段将地面图解绑绑绑框与假图框(模型本身预测的框)结合起来,并在下一次培训中加入由此得出的说明。我们证明,我们自定速度的学习战略在两套历史文件上取得了显著的性能收益,提高了YOLOv4的平均精确度,在一套数据集上增加了12%以上,在另一套数据中增加了39%。