Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
翻译:场景文本检索在精确文本定位的辅助下取得了显著进展。然而,现有方法通常需要昂贵的边界框标注进行训练。此外,它们大多采用定制化的检索策略,但难以统一各类查询以满足多样化的检索需求。为解决这些问题,我们提出了基于注意力循环的多查询场景文本检索方法(MSTAR),一种无框的场景文本检索方法。该方法通过渐进式视觉嵌入动态捕获文本的多粒度表示,并通过风格感知指令协调自由格式的文本查询。此外,我们集成了一个多实例匹配模块以增强视觉-语言对齐。进一步地,我们构建了多查询文本检索(MQTR)数据集,这是首个用于评估模型多查询场景文本检索能力的基准数据集,包含四种查询类型和1.6万张图像。大量实验证明了我们的方法在七个公共数据集和MQTR数据集上的优越性。值得注意的是,MSTAR在Total-Text数据集上的平均精度均值(MAP)以6.4%的优势略微超越先前的最优模型,同时消除了边界框标注成本。此外,在MQTR基准测试中,MSTAR平均以8.5%的幅度显著优于先前模型。代码和数据集可在https://github.com/yingift/MSTAR获取。