Oracle-MNIST: 用于衡量机器学习数值基准的现实图像数据集 (Oracle-MNIST: a Realistic Image Dataset for Benchmarking Machine Learning Algorithms)

We introduce the Oracle-MNIST dataset, comprising of 28$\times $28 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern classification, with particular challenges on image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST shares the same data format with the original MNIST dataset, allowing for direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from 1) extremely serious and unique noises caused by three-thousand years of burial and aging and 2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research. The dataset is freely available at https://github.com/wm-bupt/oracle-mnist.

翻译：我们引入了甲骨文-MNIST数据集,该数据集由28美元计时的28个灰色图像组成,包括来自10类的30 222个古代字符的28个灰色图像,用于基准模式分类,在图像噪声和扭曲方面特别具有挑战性。培训集完全由27 222个图像组成,测试集每类包含300个图像。甲骨文-MNIST与原有的MNIST数据集有着相同的数据格式,允许与所有现有的分类和系统直接兼容,但与MNIST相比,这是一个更具挑战性的分类任务。古代字符的图像有:1)由三年的埋葬和老化造成的极其严重和独特的噪音,2)由古代中国人造成的急剧变异的写作风格,这些风格都使机器学习研究具有现实性。数据集可以在https://github.com/wm-but/orcle-mnist-mnist免费查阅。

相关内容

Machine Learning

关注 2245

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日