We study a novel multimodal-learning problem, which we call text matching: given an image containing a single-line text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. We devise the first machine-learning model specifically designed for this problem. The proposed model, termed TextMatcher, compares the two inputs by applying a cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion. We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset. Results attest that, compared to a baseline and existing models designed for related problems, TextMatcher achieves higher performance on a variety of configurations, while at the same time running faster at inference time. We also showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.
翻译:我们研究一个新颖的多式学习问题,我们称之为文本匹配:鉴于一张包含单线文本和候选文本抄录的图像,我们的目标是评估图像中所代表的文本是否与候选文本相符;我们设计了第一个专门为这一问题设计的机器学习模型;拟议的模型称为TextMacher,通过对图像和文本的嵌入式表示采用交叉注意机制,比较了两种投入,并进行了端对端培训;我们广泛评价了广受欢迎的IMA数据集的文本Matterer的经验性表现;结果证明,与为相关问题设计的基线和现有模型相比,TextMacher在各种配置上取得了更高的业绩,同时在推论时间运行得更快;我们还展示了关于自动处理银行支票的真实应用情景中的TextMacher。