HIDB1M -- -- 手写文件图像比额100万数据集 (HDIB1M -- Handwritten Document Image Binarization 1 Million Dataset)

Handwritten document image binarization is a challenging task due to high diversity in the content, page style, and condition of the documents. While the traditional thresholding methods fail to generalize on such challenging scenarios, deep learning based methods can generalize well however, require a large training data. Current datasets for handwritten document image binarization are limited in size and fail to represent several challenging real-world scenarios. To solve this problem, we propose HDIB1M - a handwritten document image binarization dataset of 1M images. We also present a novel method used to generate this dataset. To show the effectiveness of our dataset we train a deep learning model UNetED on our dataset and evaluate its performance on other publicly available datasets. The dataset and the code will be made available to the community.

翻译：由于文件的内容、页面样式和条件差异很大,手写文档图像的二进制是一项艰巨的任务。虽然传统的门槛化方法无法概括这种具有挑战性的情景,但深层次的学习方法可以很好地概括,但需要大量的培训数据。手写文档图像的二进制目前数据集的大小有限,无法代表若干具有挑战性的现实世界情景。为了解决这个问题,我们提议 HRDB1M - 手写文档图像的一进制数据集。我们还提出了一个用于生成这一数据集的新颖方法。为了展示我们的数据集的有效性,我们在我们的数据集上培训了一个深层学习模型UNetED, 并在其他可公开获取的数据集上评价其性能。数据集和代码将提供给社区使用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

一份简单《图神经网络》教程，28页ppt

专知会员服务

127+阅读 · 2020年8月2日

【CVPR2020-英伟达】从图像集合中学习自监督视点，Self-Supervised Viewpoint Learning From Image Collections

专知会员服务

24+阅读 · 2020年4月4日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日