The extraction of text information in videos serves as a critical step towards semantic understanding of videos. It usually involved in two steps: (1) text recognition and (2) text classification. To localize texts in videos, we can resort to large numbers of text recognition methods based on OCR technology. However, to our knowledge, there is no existing work focused on the second step of video text classification, which will limit the guidance to downstream tasks such as video indexing and browsing. In this paper, we are the first to address this new task of video text classification by fusing multimodal information to deal with the challenging scenario where different types of video texts may be confused with various colors, unknown fonts and complex layouts. In addition, we tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information. Furthermore, contrastive learning is utilized to explore inherent connections between samples using plentiful unlabeled videos. Finally, we construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications. Extensive experiments on TI-News demonstrate the effectiveness of our method.
翻译:视频中文本信息的提取是了解视频语义的关键一步,通常涉及两个步骤:(1) 文本识别和(2) 文本分类。为使视频文本本地化,我们可以借助基于OCR技术的大量文本识别方法。然而,据我们所知,目前没有侧重于视频文本分类第二步的现有工作,这将限制对视频索引和浏览等下游任务的指导。在本文件中,我们首先通过使用多式联运信息处理视频文本分类这一新的任务,处理具有挑战性的情况,即不同种类的视频文本可能与不同颜色、未知字体和复杂布局混淆。此外,我们设计了一个名为CorrelegalNet的具体模块,通过明确提取布局信息加强特征代表。此外,利用对比性学习来探索样本之间的内在联系,使用细微的无标签视频进行。最后,我们从新闻领域建立了一个新的定义明确的工业数据集,称为TI-News,专门建设和评估视频文本识别和分类应用。关于TI-News的大规模实验展示了我们的方法的有效性。