Data processing is an important step in various natural language processing tasks. As the commonly used datasets in named entity recognition contain only a limited number of samples, it is important to obtain additional labeled data in an efficient and reliable manner. A common practice is to utilize large monolingual unlabeled corpora. Another popular technique is to create synthetic data from the original labeled data (data augmentation). In this work, we investigate the impact of these two methods on the performance of three different named entity recognition tasks.
翻译:数据处理是各种自然语言处理任务的一个重要步骤。由于名称实体确认中常用的数据集只包含数量有限的样本,必须以有效和可靠的方式获得额外的标签数据。常见的做法是使用大型单一语言的无标签公司。另一种流行的方法是从原有标签数据(数据扩增)中创建合成数据。在这项工作中,我们调查这两种方法对三个不同名称实体确认任务的执行情况的影响。