We present a novel data generation tool for document processing. The tool focuses on providing a maximal level of visual information in a normal type document, ranging from character position to paragraph-level position. It also enables working with a large dataset on low-resource languages as well as providing a mean of processing thorough full-level information of the documented text. The data generation tools come with a dataset of 320000 Vietnamese synthetic document images and an instruction to generate a dataset of similar size in other languages. The repository can be found at: https://github.com/tson1997/SDL-Document-Image-Generation
翻译:我们为文件处理提供了一个新的数据生成工具,该工具侧重于在普通类型文档中提供从字符位置到段落级别位置的最多水平的视觉信息,它还使得能够与大量低资源语言数据集合作,并提供一种手段,全面处理有记录文本的全部信息。数据生成工具配有320 000越南合成文件图像数据集,以及用其他语言生成类似大小的数据集的指示。该存储器可在以下网址查阅:https://github.com/tson1997/SDL-Document-Image-Generaration。