This paper elaborates on the notion of uncertainty in the context of annotation in large text corpora, specifically focusing on (but not limited to) historical languages. Such uncertainty might be due to inherent properties of the language, for example, linguistic ambiguity and overlapping categories of linguistic description, but could also be caused by lacking annotation expertise. By examining annotation uncertainty in more detail, we identify the sources and deepen our understanding of the nature and different types of uncertainty encountered in daily annotation practice. Moreover, some practical implications of our theoretical findings are also discussed. Last but not least, this article can be seen as an attempt to reconcile the perspectives of the main scientific disciplines involved in corpus projects, linguistics and computer science, to develop a unified view and to highlight the potential synergies between these disciplines.
翻译:本文件阐述了大文本公司在说明中出现的不确定性概念,具体侧重于(但不限于)历史语言,这种不确定性可能是由于语言的固有特性,例如语言模糊和语言描述的重叠类别,也可能是缺乏说明专业知识造成的。通过更详细地审查说明不确定性,我们查明了来源,加深了我们对日常说明做法中遇到的不确定性的性质和不同类型的理解。此外,还讨论了我们理论结论的一些实际影响。最后但并非最不重要的是,这一条可被视为试图调和涉及物质项目、语言学和计算机科学的主要科学学科的观点,以形成统一的观点,并突出这些学科之间的潜在协同作用。