Theory of convolutional neural networks suggests the property of shift equivariance, i.e., that a shifted input causes an equally shifted output. In practice, however, this is not always the case. This poses a great problem for scene text detection for which a consistent spatial response is crucial, irrespective of the position of the text in the scene. Using a simple synthetic experiment, we demonstrate the inherent shift variance of a state-of-the-art fully convolutional text detector. Furthermore, using the same experimental setting, we show how small architectural changes can lead to an improved shift equivariance and less variation of the detector output. We validate the synthetic results using a real-world training schedule on the text detection network. To quantify the amount of shift variability, we propose a metric based on well-established text detection benchmarks. While the proposed architectural changes are not able to fully recover shift equivariance, adding smoothing filters can substantially improve shift consistency on common text datasets. Considering the potentially large impact of small shifts, we propose to extend the commonly used text detection metrics by the metric described in this work, in order to be able to quantify the consistency of text detectors.
翻译:进化神经网络的理论表明变换等量的属性, 也就是说, 变换输入会导致一个相同的变换输出。 但是, 在实践中, 情况并不总是如此。 这给现场文本检测带来了巨大的问题, 无论文字在现场的位置如何, 空间反应一致至关重要。 我们使用简单的合成实验, 展示了最先进的全变动文本检测器的内在变换差异。 此外, 使用同样的实验设置, 我们展示了小的建筑变化能如何导致变换等值的改善和探测器输出的变异性减少。 我们在文本检测网络上使用真实世界培训时间表验证合成结果。 为了量化变异性的数量, 我们根据完善的文本检测基准提出了矩阵。 虽然拟议的建筑变化无法完全恢复变换等, 增加平滑动过滤器可以大大改善通用文本数据集的变异性一致性。 考虑到小变换可能带来的巨大影响, 我们建议扩大本文中描述的参数中通用的文本识别度, 以便量化文本的一致性 。