Imbalanced classification problems are extremely common in natural language processing and are solved using a variety of resampling and filtering techniques, which often involve making decisions on how to select training data or decide which test examples should be labeled by the model. We examine the tradeoffs in model performance involved in choices of training sample and filter training and test data in heavily imbalanced token classification task and examine the relationship between the magnitude of these tradeoffs and the base rate of the phenomenon of interest. In experiments on sequence tagging to detect rare phenomena in English and Arabic texts, we find that different methods of selecting training data bring tradeoffs in effectiveness and efficiency. We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model performance as selecting training data. The base rate of a rare positive class has a clear effect on the magnitude of the changes in performance caused by the selection of training or test data. As the base rate increases, the differences brought about by those choices decreases.
翻译:平衡的分类问题在自然语言处理中极为常见,采用各种重新采样和过滤技术加以解决,这些技术往往涉及决定如何选择培训数据,或决定该模型应标注哪些测试实例。我们检查在选择培训抽样和过滤培训和测试数据方面,在选择培训抽样和筛选培训和测试数据方面的模型性能权衡,在极为不平衡的象征性分类任务中,我们检查这些权衡的幅度与兴趣现象的基准率之间的关系。在为检测英语和阿拉伯语文本中的罕见现象而进行排序标记的实验中,我们发现,选择培训数据的不同方法在有效性和效率方面带来权衡。我们也发现,在高度不平衡的情况下,使用第一流检索模型过滤测试数据对于示范性业绩与选择培训数据一样重要。一个罕见的积极类的基本率对选择培训或测试数据导致的绩效变化的程度有着明显的影响。随着基率的增加,这些选择所带来的差异也会缩小。