Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We tested MetaEnhance on this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven fields.
翻译:元数据质量对于数字对象通过数字图书馆界面被发现至关重要。然而,由于各种原因,数字对象的元数据常常呈现出不完整、不一致和不正确的值。我们研究了自动检测、纠正和规范学术元数据的方法,以电子学位论文(ETDs)的七个关键字段为案例研究。我们提出了MetaEnhance,一个利用最先进的人工智能方法来提高这些字段质量的框架。为了评估MetaEnhance,在多个标准的子集抽样的基础上,我们编制了一个包含500个ETD的元数据质量评估基准。我们在这个基准上测试了MetaEnhance,并发现所提出的方法在检测错误方面实现了近乎完美的F1得分,在纠正五个七个字段的错误方面的F1得分在0.85到1.00之间。