Diacritic characters can be considered as a unique set of characters providing us with adequate and significant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images using the presence of diacritic characters in order to improve Optical Character Recognition (OCR) performance in any given automated environment. We showcase our work across 13 Latin languages encompassing 85 diacritic characters. We use an architecture similar to Squeezedet for object detection of diacritic characters followed by a shallow network to finally identify the language. OCR systems when accompanied with identified language parameter tends to produce better results than sole deployment of OCR systems. The discussed work apart from guaranteeing an improvement in OCR results also takes on-device (mobile phone) constraints into consideration in terms of model size and inference time.
翻译:诊断学虽然与语音有关,但往往成为许多语言的显著特征,特别是拉丁文字的语言。在这项拟议工作中,我们的目标是利用有异亚字字符的图像中识别文字语言,以提高任何特定自动环境中光学字符的性能。我们展示了13种拉丁语言的作品,其中包括85个异亚字字符。我们使用类似于Squeezedet的建筑来对异亚字进行天体检测,然后用浅网络最终识别语言。在与所确定语言参数相伴的OCR系统往往产生比单独部署OCR系统更好的结果。讨论的结果除了保证OCR的改进外,还考虑到模型大小和推导时间方面的差异(移动电话)。