We present the largest publicly available synthetic OCR benchmark dataset for Indic languages. The collection contains a total of 90k images and their ground truth for 23 Indic languages. OCR model validation in Indic languages require a good amount of diverse data to be processed in order to create a robust and reliable model. Generating such a huge amount of data would be difficult otherwise but with synthetic data, it becomes far easier. It can be of great importance to fields like Computer Vision or Image Processing where once an initial synthetic data is developed, model creation becomes easier. Generating synthetic data comes with the flexibility to adjust its nature and environment as and when required in order to improve the performance of the model. Accuracy for labeled real-time data is sometimes quite expensive while accuracy for synthetic data can be easily achieved with a good score.
翻译:我们为印度语提供了最大的向公众公开的合成OCR基准数据集。该收集包含总共90k图像,并为23种印度语提供了这些图像的地面真相。印度语的OCR模型验证需要大量不同的数据才能建立可靠和可靠的模型。生成如此大量的数据将很难,但使用合成数据则容易得多。对于计算机视野或图像处理等领域非常重要,这些领域一旦开发了初始合成数据,模型的创建就变得更容易。生成合成数据时需要灵活调整其性质和环境,以便改进模型的性能。标记实时数据的准确性有时会非常昂贵,而合成数据的准确性则可以通过良好的评分很容易实现。