综合黄金标准和漫画文字探测和识别基准 (A Comprehensive Gold Standard and Benchmark for Comics Text Detection and Recognition)

This study focuses on improving the optical character recognition (OCR) data for panels in the COMICS dataset, the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for western comics, called "COMICS Text+: Detection" and "COMICS Text+: Recognition". We evaluated the performance of state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called "COMICS Text+", which contains the extracted text from the textboxes in the COMICS dataset. Using the improved text data of COMICS Text+ in the comics processing model from resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ dataset can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All the data and inference instructions can be accessed in https://github.com/gsoykan/comics_text_plus.

翻译：这项研究的重点是改进COMICS数据集中包含漫画书籍文本和图像的最大数据集(COMICS数据集)各面板的光学字符识别数据(OCR),为此,我们开发了一个用于漫画书籍处理和标签的OCR编程,并创建了西非漫画首个文本检测和识别数据集,称为“COMICS Text+:检测”和“COMICS Text+:识别”。我们评估了这些数据集的最新文本检测和识别模型的性能,发现与COMICS的文本相比,单词准确性和正常编辑距离有了显著改进。我们还创建了一个名为“COMICS Text+”的新数据集,其中包含了COMICS数据集中文本框中提取的文本。使用COMICS Text+在漫画处理模型中改进的文本数据,该模型的结果是在不改变模型结构的情况下对木质风格任务进行最先进的性能表现。COMICS Text+数据集可以成为从事包括文字检测、识别和高层次处理的研究人员的宝贵资源,例如叙述性能理解、字符关系和故事生成的所有数据和在 MAggs_rusmusmus/coms 中查阅。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日