CASE-2022多语言抗议活动发现任务:多语言抗议新闻探测和自动复制人工创建的事件数据集 (ClassBases at CASE-2022 Multilingual Protest Event Detection Tasks: Multilingual Protest News Detection and Automatically Replicating Manually Created Event Datasets)

数据集 · 词元分析器 · MoDELS · Better · entity ·

2023 年 1 月 16 日

ClassBases at CASE-2022 Multilingual Protest Event Detection Tasks: Multilingual Protest News Detection and Automatically Replicating Manually Created Event Datasets

翻译：CASE-2022多语言抗议活动发现任务:多语言抗议新闻探测和自动复制人工创建的事件数据集

Peratham Wiriyathammabhum

from arxiv, EMNLP workshop 2022. CASE 2022. 1st in Hindi zero-shot protest-event document classification

In this report, we describe our ClassBases submissions to a shared task on multilingual protest event detection. For the multilingual protest news detection, we participated in subtask-1, subtask-2, and subtask-4, which are document classification, sentence classification, and token classification. In subtask-1, we compare XLM-RoBERTa-base, mLUKE-base, and XLM-RoBERTa-large on finetuning in a sequential classification setting. We always use a combination of the training data from every language provided to train our multilingual models. We found that larger models seem to work better and entity knowledge helps but at a non-negligible cost. For subtask-2, we only submitted an mLUKE-base system for sentence classification. For subtask-4, we only submitted an XLM-RoBERTa-base for token classification system for sequence labeling. For automatically replicating manually created event datasets, we participated in COVID-related protest events from the New York Times news corpus. We created a system to process the crawled data into a dataset of protest events.

翻译：在本报告中,我们描述了我们向多语种抗议事件探测共同任务提交的分类基准文件。为了进行多语种抗议新闻探测,我们参与了子任务-1、子任务-2和子任务-4,它们是文件分类、判决分类和象征性分类。在子任务-1,我们比较了XLM-ROBERTA基准、 mLUKE 基准和XLM-ROBERTA(大型)在顺序分类设置中的微调。我们总是使用来自所提供的每一种语言的培训数据组合来培训我们的多语种模型。我们发现,较大的模型似乎效果更好,实体知识帮助但费用不低。对于子任务-2,我们只提交了用于分类的 mLUKE基准系统。对于子任务-4,我们只提交了用于序列标签的标记分类系统的 XLM-ROBERTA基准。为了自动复制人工创建的事件数据集,我们参加了纽约时报新闻集中与COVID有关的抗议活动。我们创建了一个系统,将爬行数据处理成抗议事件数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日