Well-annotated datasets, as shown in recent top studies, are becoming more important for researchers than ever before in supervised machine learning (ML). However, the dataset annotation process and its related human labor costs remain overlooked. In this work, we analyze the relationship between the annotation granularity and ML performance in sequence labeling, using clinical records from nursing shift-change handover. We first study a model derived from textual language features alone, without additional information based on nursing knowledge. We find that this sequence tagger performs well in most categories under this granularity. Then, we further include the additional manual annotations by a nurse, and find the sequence tagging performance remaining nearly the same. Finally, we give a guideline and reference to the community arguing it is not necessary and even not recommended to annotate in detailed granularity because of a low Return on Investment. Therefore we recommend emphasizing other features, like textual knowledge, for researchers and practitioners as a cost-effective source for increasing the sequence labeling performance.
翻译:如最近的顶层研究所示,对研究人员来说,在受监督的机器学习(ML)中,附加说明的数据集比以往更加重要。然而,数据集说明过程及其相关的人工成本仍然被忽视。在这项工作中,我们利用护理转变交接的临床记录,分析批注颗粒性能和ML性能在序列标签中的关系。我们首先研究单由文字语言特征产生的模型,而没有基于护理知识的额外信息。我们发现这一序列图格在这一颗粒下的大多数类别中表现良好。然后,我们进一步纳入护士的额外手动说明,发现标记性能的顺序几乎保持不变。最后,我们给社区提供了指南和参考,指出由于投资回报率低,它没有必要,甚至不建议在详细的颗粒性上作详细说明。因此,我们建议强调研究人员和从业人员的其他特征,如文字知识,作为提高序列标签性能的具有成本效益的来源。