When annotators label data, a key metric for quality assurance is inter-annotator agreement (IAA): the extent to which annotators agree on their labels. Though many IAA measures exist for simple categorical and ordinal labeling tasks, relatively little work has considered more complex labeling tasks, such as structured, multi-object, and free-text annotations. Krippendorff's alpha, best known for use with simpler labeling tasks, does have a distance-based formulation with broader applicability, but little work has studied its efficacy and consistency across complex annotation tasks. We investigate the design and evaluation of IAA measures for complex annotation tasks, with evaluation spanning seven diverse tasks: image bounding boxes, image keypoints, text sequence tagging, ranked lists, free text translations, numeric vectors, and syntax trees. We identify the difficulty of interpretability and the complexity of choosing a distance function as key obstacles in applying Krippendorff's alpha generally across these tasks. We propose two novel, more interpretable measures, showing they yield more consistent IAA measures across tasks and annotation distance functions.
翻译:当说明者标签数据时,质量保证的一个关键衡量标准是说明者之间的协议:说明者就其标签达成一致的程度。虽然对简单的绝对和正统标签任务有许多措施存在,但相对较少的工作考虑更复杂的标签任务,例如结构化、多对象和自由文本说明。最著名的用更简单的标签任务使用的Krippendorf的字母,确实有一个基于距离的配方,具有更广泛的适用性,但很少研究它在复杂的注解任务中的效力和一致性。我们调查了设计和评价国际宇航科学院的复杂注解任务措施,评价涉及7项不同任务:图像捆绑框、图像键、文本顺序标记、排名列表、自由文本翻译、数字矢量和通税树。我们确定了在应用Krippendorf的字母时选择距离功能作为关键障碍的困难和复杂性。我们提出了两个新颖的、更可解释的措施,显示它们产生更加一致的跨任务和说明距离功能。