Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at github:com=weijiawu=BOVText and github:com=weijiawu=TransVTSpotter, respectively.
翻译:多数现有视频文本显示基准侧重于评价单一语言和假设情景, 且数据有限。 在这项工作中, 我们引入了大规模、 双语、 开放世界视频文本基准数据集( BOVText ) 。 BOVText 有四个功能。 首先, 我们提供2,000+视频, 超过 1,750,000+框架, 比现有最大数据集大25倍, 附带视频文本。 其次, 我们的数据集覆盖30+开放类别, 广泛选择了各种情景, 例如生活视频、 驾驶、 电影等 。 第三, 为视频中的不同表达意义提供了大量文本类型( 即标题、 标题、 标题或场景文本 ) 。 第四, BOVVText 提供双语文本说明, 以促进多种文化现场和交流。 此外, 我们提议一个端到端视频文本定位框架, 名为 TransVTerVTSpotter, 能够用简单但高效的注意查询机制解决视频中多处文本定位。 它应用了前一框的文本( 标题、 标题、 标题) 作为当前框架的跟踪查询工具 。 在 TransVI- far- frevji=TA 上, 的运行中, 将运行数据引入运行到 。