This paper is a presentation of a new method for denoising images using Haralick features and further segmenting the characters using artificial neural networks. The image is divided into kernels, each of which is converted to a GLCM (Gray Level Co-Occurrence Matrix) on which a Haralick Feature generation function is called, the result of which is an array with fourteen elements corresponding to fourteen features The Haralick values and the corresponding noise/text classification form a dictionary, which is then used to de-noise the image through kernel comparison. Segmentation is the process of extracting characters from a document and can be used when letters are separated by white space, which is an explicit boundary marker. Segmentation is the first step in many Natural Language Processing problems. This paper explores the process of segmentation using Neural Networks. While there have been numerous methods to segment characters of a document, this paper is only concerned with the accuracy of doing so using neural networks. It is imperative that the characters be segmented correctly, for failing to do so will lead to incorrect recognition by Natural language processing tools. Artificial Neural Networks was used to attain accuracy of upto 89%. This method is suitable for languages where the characters are delimited by white space. However, this method will fail to provide acceptable results when the language heavily uses connected letters. An example would be the Devanagari script, which is predominantly used in northern India.
翻译:本文展示了使用 Haralick 特性解密图像的新方法, 以及使用人工神经网络进一步分割字符的新方法。 图像分为内核, 每一个内核都转换为 GLCM (Gray level Co- Occurence 矩阵), 其中每个内核都转换为 GLCM (Gray level Co- Occurence 矩阵), 并使用 Haralick 特性生成功能, 其结果是一组14 个元素的阵列, 与14 个特性相对应 Haralick 值和相应的噪音/ 文本分类组成了字典, 用于通过内核比较来解密图像。 分解是一个从文档中提取字符的过程, 由白色空间空间分隔工具将字符分离出来, 由白色空间分隔为明确标记。 分解是许多自然语言的北页格式, 也就是用于北页格式的平整式格式, 用于北页格式的平坦度格式, 也就是用于北页缩式格式的平坦式格式, 。 北面空间网络的平面格式是用于北端格式的平整式格式的平整式格式方法, 。 此方法是可接受的平整式的平整式格式, 。