Recent works of opinion expression identification (OEI) rely heavily on the quality and scale of the manually-constructed training corpus, which could be extremely difficult to satisfy. Crowdsourcing is one practical solution for this problem, aiming to create a large-scale but quality-unguaranteed corpus. In this work, we investigate Chinese OEI with extremely-noisy crowdsourcing annotations, constructing a dataset at a very low cost. Following zhang et al. (2021), we train the annotator-adapter model by regarding all annotations as gold-standard in terms of crowd annotators, and test the model by using a synthetic expert, which is a mixture of all annotators. As this annotator-mixture for testing is never modeled explicitly in the training phase, we propose to generate synthetic training samples by a pertinent mixup strategy to make the training and testing highly consistent. The simulation experiments on our constructed dataset show that crowdsourcing is highly promising for OEI, and our proposed annotator-mixup can further enhance the crowdsourcing modeling.
翻译:最近的意见表达识别作品(OEI)在很大程度上依赖人工构建的训练材料的质量和规模,这极难满足。 众包是解决这个问题的一个实际解决办法,旨在创建大规模但质量共生的体。 在这项工作中,我们用极不常见的众包说明调查中国的OEI, 以非常低的成本构建数据集。 在Zhang等人(2021年)之后,我们通过将所有说明作为人群识别器的黄金标准来培训批注-调试模型,并使用合成专家来测试模型,而合成专家是所有批注的混合体。由于这种用于测试的批注-混合体在培训阶段从未明确建模,我们提议通过相关的混合战略生成合成培训样本,以使培训和测试高度一致。 在对我们构建的数据集进行的模拟实验表明,众包对于OEI来说很有希望,我们提议的批注-混合体可以进一步增强众包模型。