Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have significantly advanced the state-of-the-art for zero-shot cross-lingual information extraction. These language models ubiquitously rely on word segmentation techniques that break a word into smaller constituent subwords. Therefore, all word labeling tasks (e.g. named entity recognition, event detection, etc.), necessitate a pooling strategy that takes the subword representations as input and outputs a representation for the entire word. Taking the task of cross-lingual event detection as a motivating example, we show that the choice of pooling strategy can have a significant impact on the target language performance. For example, the performance varies by up to 16 absolute $f_{1}$ points depending on the pooling strategy when training in English and testing in Arabic on the ACE task. We carry out our analysis with five different pooling strategies across nine languages in diverse multi-lingual datasets. Across configurations, we find that the canonical strategy of taking just the first subword to represent the entire word is usually sub-optimal. On the other hand, we show that attention pooling is robust to language and dataset variations by being either the best or close to the optimal strategy. For reproducibility, we make our code available at https://github.com/isi-boston/ed-pooling.
翻译:预先培训的多语言模式(例如, mBERT、 XLM- ROBERTA) 已经大大推进了最先进的零点跨语言信息提取技术。 这些语言模式无处不在地依赖将一个单词破碎成较小组成子字的文字分割技术。 因此,所有词汇标签任务(例如,名称实体识别、事件检测等)都需要一个集合战略,将子字表达作为整个词的输入和输出。 我们把跨语言事件检测作为激励性例子,我们表明,联合战略的选择可以对目标语言绩效产生重大影响。 例如,当英语培训和测试阿拉伯语任务时,这些语言模式的性能不一到16美元绝对值,取决于集合战略。 我们用多种语言数据集对九种语言的五种不同的集合战略进行分析。 跨语言组合,我们发现仅仅使用第一个子字来代表整个词的卡通战略通常是次操作性极小的。 在另一手,我们把注意力集中到最强的代码上,我们用最强的代码来调整。