The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.
翻译:手写文本识别(HTR)模型开发过程中对手写文本识别(HTR)模型的评估是直截了当的:因为HTR是一个受监督的问题,通常的数据被分为培训、验证和测试数据组,从而能够从准确率或误差率的角度对模型进行评估。然而,当我们从开发转向应用时,评价过程就变得棘手了。从我们想要应用该模型的抽样数据汇编新的(和强行缩小的)地面真相(GT),然后对模型进行评估,只能提供公认的文本质量的提示,信任评分(如果有的话)模型返回。此外,如果我们手头有几个模型,我们面临一个模型选择问题,因为我们想要在应用阶段获得尽可能最佳的结果。这要求无GT的参数选择最佳模型,这就是为什么我们(重新)利用标准语言模型和隐蔽语言模型(MLMM),对不同的基准进行比较,而使用标准语言模型和隐蔽语言模型(MLMM)。我们表明,基于MLM的评估可以与基于格的方法竞争,因为我们手头有几种模型,因为我们想要在应用应用尽可能大的多语言变换的模型。