Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed text query over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments in four video text datasets (i.e.,ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that TransDETR achieves state-of-the-art performance with up to around 8.0% improvements on video text spotting tasks. The code of TransDETR can be found at https://github.com/weijiawu/TransDETR.
翻译:近期视频文本检测方法通常要求采用三个阶段的管道,即,在单个图像中检测文本,承认本地文本,跟踪文本流,以后处理产生最终结果。这些方法通常遵循逐次跟踪的模式,并开发复杂的管道。在本文件中,基于变换序列模型,我们提出了一个简单但有效的端到端视频文本显示、跟踪和识别框架(TransDETR)。 TransDETR主要包括两个优点:1)不同于相邻框架的直匹配模式,TranDETR轨道,并隐含地承认由不同查询称为长程时间序列(超过7个框架)的文本查询的每一种文本。2 TransDETR是第一个端到端可培训视频文本检测框架,同时处理三个子任务(例如,文本检测、跟踪、识别),在四个视频文本数据集(即,ICDAR2013视频、ICDAR2015视频、Mytto和YouTube视频文本文本)中进行广泛的实验,以显示TR在远程时间序列(超过7框架)上实现州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-