To be robust enough for widespread adoption, document analysis systems involving machine learning models must be able to respond correctly to inputs that fall outside of the data distribution that was used to generate the data on which the models were trained. This paper explores the ability of text classifiers trained on standard document classification datasets to generalize to out-of-distribution documents at inference time. We take the Tobacco-3482 and RVL-CDIP datasets as a starting point and generate new out-of-distribution evaluation datasets in order to analyze the generalization performance of models trained on these standard datasets. We find that models trained on the smaller Tobacco-3482 dataset perform poorly on our new out-of-distribution data, while text classification models trained on the larger RVL-CDIP exhibit smaller performance drops.
翻译:涉及机器学习模型的文件分析系统要足够强大,以便广泛采用,就必须能够对数据分发之外的投入作出正确反应,而数据分发是用来生成模型培训所依据的数据。本文探讨了受过标准文件分类数据集培训的文本分类员的能力,这些分类员在标准文件分类数据集方面进行了培训,以便在推论时间将文件分发范围普遍化;我们把烟草-3482和RVL-CDIP数据集作为一个起点,并产生新的分配范围外评价数据集,以便分析经过培训的这些标准数据集模型的一般性能。我们发现,经过培训的关于烟草-3482小数据集的模型在我们新的分发范围外数据方面表现不佳,而经过培训的关于更大的RVL-CDIP的文本分类模型则表现较差。