Recognizing handwriting images is challenging due to the vast variation in writing style across many people and distinct linguistic aspects of writing languages. In Vietnamese, besides the modern Latin characters, there are accent and letter marks together with characters that draw confusion to state-of-the-art handwriting recognition methods. Moreover, as a low-resource language, there are not many datasets for researching handwriting recognition in Vietnamese, which makes handwriting recognition in this language have a barrier for researchers to approach. Recent works evaluated offline handwriting recognition methods in Vietnamese using images from an online handwriting dataset constructed by connecting pen stroke coordinates without further processing. This approach obviously can not measure the ability of recognition methods effectively, as it is trivial and may be lack of features that are essential in offline handwriting images. Therefore, in this paper, we propose the Transferring method to construct a handwriting image dataset that associates crucial natural attributes required for offline handwriting images. Using our method, we provide a first high-quality synthetic dataset which is complex and natural for efficiently evaluating handwriting recognition methods. In addition, we conduct experiments with various state-of-the-art methods to figure out the challenge to reach the solution for handwriting recognition in Vietnamese.
翻译:在越南,除了现代的拉丁字符外,还有口音和字母标记以及字符,给最先进的笔迹识别方法带来混乱。此外,作为一个低资源语言,越南没有太多用于研究笔迹识别的数据集,这使得这种语言的笔迹识别对研究人员来说是一个障碍。最近的工作利用通过将笔记坐标连接起来而无需进一步处理的在线笔迹识别数据集对越南的脱线笔迹识别方法进行了评估。这一方法显然无法有效地衡量识别方法的能力,因为它是微不足道的,而且可能缺乏非线性笔迹图像中必不可少的特征。因此,在本文中,我们建议采用转移方法构建笔迹图像数据集,将离线笔迹图像所需的关键自然属性联系起来。我们使用我们的方法提供了第一个高质量的合成数据集,该数据集对有效评估笔迹识别方法是复杂和自然的。此外,我们用各种最先进的方法进行实验,以找出在越南实现笔迹识别解决方案的挑战。