In this report, we describe our ClassBases submissions to a shared task on multilingual protest event detection. For the multilingual protest news detection, we participated in subtask-1, subtask-2, and subtask-4, which are document classification, sentence classification, and token classification. In subtask-1, we compare XLM-RoBERTa-base, mLUKE-base, and XLM-RoBERTa-large on finetuning in a sequential classification setting. We always use a combination of the training data from every language provided to train our multilingual models. We found that larger models seem to work better and entity knowledge helps but at a non-negligible cost. For subtask-2, we only submitted an mLUKE-base system for sentence classification. For subtask-4, we only submitted an XLM-RoBERTa-base for token classification system for sequence labeling. For automatically replicating manually created event datasets, we participated in COVID-related protest events from the New York Times news corpus. We created a system to process the crawled data into a dataset of protest events.
翻译:在本报告中,我们描述了我们向多语种抗议事件探测共同任务提交的分类基准文件。为了进行多语种抗议新闻探测,我们参与了子任务-1、子任务-2和子任务-4,它们是文件分类、判决分类和象征性分类。在子任务-1,我们比较了XLM-ROBERTA基准、 mLUKE 基准和XLM-ROBERTA(大型)在顺序分类设置中的微调。我们总是使用来自所提供的每一种语言的培训数据组合来培训我们的多语种模型。我们发现,较大的模型似乎效果更好,实体知识帮助但费用不低。对于子任务-2,我们只提交了用于分类的 mLUKE基准系统。对于子任务-4,我们只提交了用于序列标签的标记分类系统的 XLM-ROBERTA基准。为了自动复制人工创建的事件数据集,我们参加了纽约时报新闻集中与COVID有关的抗议活动。我们创建了一个系统,将爬行数据处理成抗议事件数据集。