Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes. The community has paid increasing attention to boost the performance by improving the pre-processing image module, like rectification and deblurring, or the sequence translator. However, another critical module, i.e., the feature sequence extractor, has not been extensively explored. In this work, inspired by the success of neural architecture search (NAS), which can identify better architectures than human-designed ones, we propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance. First, we design a domain-specific search space for STR, which contains both choices on operations and constraints on the downsampling path. Then, we propose a two-step search algorithm, which decouples operations and downsampling path, for an efficient search in the given space. Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks with much fewer FLOPS and model parameters.
翻译:由于文本实例的多样性和场景的复杂性,场景识别(STR)非常具有挑战性。社区越来越重视通过改进预处理图像模块提高性能,如校正和分流或序列翻译。然而,另一个关键模块,即特征序列提取器尚未广泛探索。在这项工作中,由于神经结构搜索的成功(NAS)能够发现比人类设计的更好的结构,因此我们提议自动STR(AutoSTRA)搜索依赖数据的脊椎,以提高文本识别性能。首先,我们设计一个域域域域搜索空间,其中既包含操作选择,也包含下游路径的限制。然后,我们提出分解操作和下游路径的两步搜索算法,以便在给定空间进行有效搜索。实验表明,通过搜索数据依赖的脊椎,Autosstre可以超越标准基准上最先进的方法,使用更少的FLOPS和模型参数。