We present a novel approach to OCR(Optical Character Recognition) of Korean character, Hangul. As a phonogram, Hangul can represent 11,172 different characters with only 52 graphemes, by describing each character with a combination of the graphemes. As the total number of the characters could overwhelm the capacity of a neural network, the existing OCR encoding methods pre-define a smaller set of characters that are frequently used. This design choice naturally compromises the performance on long-tailed characters in the distribution. In this work, we demonstrate that grapheme encoding is not only efficient but also performant for Hangul OCR. Benchmark tests show that our approach resolves two main problems of Hangul OCR: class imbalance and target class selection.
翻译:我们展示了韩国字符 OCR( 证人字符识别) 的新型方法, 韩文。 韩文作为录音片, 韩文可以代表11, 172个不同的字符, 只有52个图形模型, 以图形模型组合的方式描述每个字符。 由于字符的总数可能超过神经网络的能力, 现有的OCR编码方法预设了通常使用的更小的字符组。 此设计选择自然会影响发行中长尾字符的性能。 在这项工作中, 我们证明图形编码不仅有效, 而且还能表现汉文 OCR。 基准测试显示, 我们的方法解决了韩文OCR的两个主要问题: 阶级不平衡和目标类选择 。