Most existing scene text detectors focus on detecting characters or words that only capture partial text messages due to missing contextual information. For a better understanding of text in scenes, it is more desired to detect contextual text blocks (CTBs) which consist of one or multiple integral text units (e.g., characters, words, or phrases) in natural reading order and transmit certain complete text messages. This paper presents contextual text detection, a new setup that detects CTBs for better understanding of texts in scenes. We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB. To this end, we design a novel scene text clustering technique that treats integral text units as tokens and groups them (belonging to the same CTB) into an ordered token sequence. In addition, we create two datasets SCUT-CTW-Context and ReCTS-Context to facilitate future research, where each CTB is well annotated by an ordered sequence of integral text units. Further, we introduce three metrics that measure contextual text detection in local accuracy, continuity, and global accuracy. Extensive experiments show that our method accurately detects CTBs which effectively facilitates downstream tasks such as text classification and translation. The project is available at https://sg-vilab.github.io/publication/xue2022contextual/.
翻译:大多数现有场景文本检测器侧重于检测由于缺少背景信息而只捕捉部分文字信息的字符或字词。为了更好地了解场景文本,我们更希望检测由自然阅读顺序中一个或多个整体文本单位(例如字符、文字或短语)组成的背景文本区块(CTBs),并传送某些完整的文本信息。本文介绍背景文本检测,这是检测CTBs以更好地了解场景文本的新设置。我们通过双重检测任务制定新的设置,首先检测整体文本单位,然后将其分组为 CTB。为此,我们设计了一个新的场景文本组群技术,将整体文本单位作为符号和组(属于相同的 CTBs),按顺序排列。此外,我们创建了两个数据集SCUT-CTW-Context 和 ReCTS-Ctext, 以便利今后的研究, 每一个CTBC20/CTruple 都有一个固定顺序的附加说明。 此外,我们引入了三个指标,用本地准确度、连续性和全球准确度衡量背景文本检测、连续性和全球准确度衡量背景文本检测结果的C-bbbralalalalalalal 。 实验显示我们的方法,在下游/caltalalalbs 的分类中可以有效检测的C-calmalmalals。