This study focuses on improving the optical character recognition (OCR) data for panels in the COMICS dataset, the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for western comics, called "COMICS Text+: Detection" and "COMICS Text+: Recognition". We evaluated the performance of state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called "COMICS Text+", which contains the extracted text from the textboxes in the COMICS dataset. Using the improved text data of COMICS Text+ in the comics processing model from resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ dataset can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All the data and inference instructions can be accessed in https://github.com/gsoykan/comics_text_plus.
翻译:这项研究的重点是改进COMICS数据集中包含漫画书籍文本和图像的最大数据集(COMICS数据集)各面板的光学字符识别数据(OCR),为此,我们开发了一个用于漫画书籍处理和标签的OCR编程,并创建了西非漫画首个文本检测和识别数据集,称为“COMICS Text+:检测”和“COMICS Text+:识别”。我们评估了这些数据集的最新文本检测和识别模型的性能,发现与COMICS的文本相比,单词准确性和正常编辑距离有了显著改进。我们还创建了一个名为“COMICS Text+”的新数据集,其中包含了COMICS数据集中文本框中提取的文本。使用COMICS Text+在漫画处理模型中改进的文本数据,该模型的结果是在不改变模型结构的情况下对木质风格任务进行最先进的性能表现。COMICS Text+数据集可以成为从事包括文字检测、识别和高层次处理的研究人员的宝贵资源,例如叙述性能理解、字符关系和故事生成的所有数据和在 MAggs_rusmusmus/coms 中查阅。