Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video. Most existing approaches tackle this task by appearance similarity in continuous frames, while ignoring the abundant semantic features. In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations. Correspondingly, we present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence. Besides, with a light-weight architecture, SVRep achieves state-of-the-art performance while maintaining competitive inference speed. Specifically, with a backbone of ResNet-18, SVRep achieves an ${\rm ID_{F1}}$ of $\textbf{65.9\%}$, running at $\textbf{16.7}$ FPS, on the ICDAR2015(video) dataset with $\textbf{8.6\%}$ improvement than the previous state-of-the-art methods.
翻译:语义表达方式对视频文本跟踪(VTT)任务非常有益,因为视频文本跟踪(SVRep)要求同时对视频文本进行分类、检测和跟踪。大多数现有方法都通过连续框架的相似性来处理这项任务,而忽略了丰富的语义特征。在本文中,我们探索了对视频文本进行有力的跟踪,对语义和视觉表达方式进行了对比性学习。相应的,我们用语义和视觉表达方式(SVRep)提出了一个端到端的视频文本跟踪器(SVRep),通过利用视频序列中不同文本之间的视觉和语义关系来检测和跟踪文本。此外,SVRep在轻量结构下实现了艺术状态的性能,同时保持了竞争性的引用速度。具体地说,在ResNet-18的支柱下,SVRep在ICDAR-2015(视频)数据集上,以$\ textf*8.6美元运行。具体地说,SVRep在前一个状态上,用$\textbf@16.7美元运行FPS(VPS)中,在ICDAR2015(VD)数据设置上用$\ textffs8.6美元改进。