Handwritten document image binarization is a challenging task due to high diversity in the content, page style, and condition of the documents. While the traditional thresholding methods fail to generalize on such challenging scenarios, deep learning based methods can generalize well however, require a large training data. Current datasets for handwritten document image binarization are limited in size and fail to represent several challenging real-world scenarios. To solve this problem, we propose HDIB1M - a handwritten document image binarization dataset of 1M images. We also present a novel method used to generate this dataset. To show the effectiveness of our dataset we train a deep learning model UNetED on our dataset and evaluate its performance on other publicly available datasets. The dataset and the code will be made available to the community.
翻译:由于文件的内容、页面样式和条件差异很大,手写文档图像的二进制是一项艰巨的任务。 虽然传统的门槛化方法无法概括这种具有挑战性的情景,但深层次的学习方法可以很好地概括,但需要大量的培训数据。手写文档图像的二进制目前数据集的大小有限,无法代表若干具有挑战性的现实世界情景。为了解决这个问题,我们提议 HRDB1M - 手写文档图像的一进制数据集。 我们还提出了一个用于生成这一数据集的新颖方法。为了展示我们的数据集的有效性,我们在我们的数据集上培训了一个深层学习模型UNetED, 并在其他可公开获取的数据集上评价其性能。 数据集和代码将提供给社区使用。