Annotated data is a requirement for applying supervised machine learning methods, and the quality of annotations is crucial for the result. Especially when working with cultural heritage collections that inhere a manifold of uncertainties, annotating data remains a manual, arduous task to be carried out by domain experts. Our project started with two already annotated sets of medieval manuscript images which however were incomplete and comprised conflicting metadata based on scholarly and linguistic differences. Our aims were to create (1) a uniform set of descriptive labels for the combined data set, and (2) a hierarchical classification of a high quality that can be used as a valuable input for supervised machine learning. To reach these goals, we developed a visual analytics system to enable medievalists to combine, regularize and extend the vocabulary used to describe these data sets. Visual interfaces for word and image embeddings as well as co-occurrences of the annotations across the data sets enable annotating multiple images at the same time, recommend annotation label candidates and support composing a hierarchical classification of labels. Our system itself implements a semi-supervised method as it updates visual representations based on the medievalists' feedback, and a series of usage scenarios document its value for the target community.
翻译:附加说明数据是应用受监督的机器学习方法的一项要求,说明的质量对结果至关重要。特别是当与文化遗产收藏合作,而文化遗产收藏在众多不确定因素中,数据说明仍然是由域专家执行的手工和艰巨的任务。我们的项目开始时有两套已加注的中世纪手稿图像,虽然这些图像不完整,而且包括基于学术和语言差异的相互矛盾的元数据。我们的目标是创建(1)一套统一的数据集描述标签,和(2)高质量分类的等级分类,可以用作受监督的机器学习的宝贵投入。为了实现这些目标,我们开发了一个视觉分析系统,使中世纪学家能够合并、规范并扩展用来描述这些数据集的词汇。文字和图像嵌入的视觉界面以及数据集图示的共存,能够同时标记多个图像,建议加注标签候选人,并支持对标签进行等级分类。我们的系统本身在根据中世纪主义者的反馈更新视觉演示时,采用了一种半监督的方法。